Tim Dettmers

My Journey Towards Coding Agents: Building SERA

Tim Dettmers — Tue, 27 Jan 2026 12:32:00 +0000

If you look at how people cook coding agents today, they have an industrial kitchen: large-scale reinforcement learning systems with many components for efficiency spanning hundreds of GPUs, complex repository mixing, and large teams working on all angles to optimize the data pipeline for training. For the family of Open Coding Agents we released today from Ai2, we had the equivalent of a hot plate and a frying pan: 32 GPUs and five bright-eyed researchers who wanted to cook state-of-the-art coding agents.

This blog post is about how we cooked that coding agent.

Along the way, I will share the struggles and technical details that usually do not make it into papers: switching fields, disappointing people, and feeling irrelevant. And then the failures along the way when we actually tried to cook an agent with our frying pan, to the eventual breakthrough.

Once we had our first breakthrough, we could upgrade our frying pan to a full set of pots and an induction oven (96 GPUs). The result? A method that lets you cheaply, with just a couple of GPU days, finetune a 32B model on your own private codebase to create a coding agent that rivals or exceeds the teacher model’s performance on this private data. And our software makes it easy to deploy it in the Claude Code environment for a smooth experience.

What This Post Covers

This post has four parts:

1. The Struggle: Switching fields, the costs of starting over, and what it felt like to be in the middle of it.

2. Getting a Foothold: My early work on coding agents, taking an 8 billion parameter model from 0% to 24% on SWE-bench.

3. Data Generation Methods: Three approaches we tried: subtask splitting, end-to-end training, and finally SERA. If you want to skip to what actually worked, jump straight to the [SERA section](#sera-soft-verified-efficient-repository-agents). If you want to understand why we ended up there, read on.

4. What This Means For You: What private repo specialization can do for you, how to deploy these models, the Claude Code integration, and what becomes possible when coding agent research is cheap.

—

The Struggle

Switching Fields is Scary

I started in a good position. I had become an expert in quantization, developing methods such as QLoRA and k-bit inference scaling laws. These were hugely successful, particularly when I integrated them into the bitsandbytes library, which reached millions of monthly downloads. QLoRA became a staple in the industry. And all findings of k-bit inference scaling laws would eventually be implemented in Blackwell GPUs on the hardware level.

But research means extending knowledge, not just refining what you already know. While I was on the job market I already knew two things that forced me to change directions: (1) quantization research and other efficiency research hitting diminishing returns and are less exciting; (2) It was clear to me that chatbot models would not yield the productivity improvements we need. So in August 2024, I made the leap, abandoned all previous work, and started working on coding agents, which I thought was the most promising direction.

It was scary.

Switching fields from a one that you are well established in to something completely new is probably one of the hardest things you can do in research. People were excited that I was joining the Allen Institute for AI. I brought expertise with me. But after I joined, and people realized I was going to work on coding agents rather than something related to my previous expertise, they were surprised. Are you going to help me with efficient training and inference? Nope, I am going to learn about coding agents.

That felt selfish. I could really help my colleagues, but I chose not to, so I can focus on transitioning to coding agents and be helpful there instead.

That was particularly painful because I really like to help people – my main research goal remains to make the best AI available and accessible to anyone. I disappointed many people whom I liked and respected. And the costs accumulated. I had not published anything in about a year. Students in this PhD admission cycle are not excited to work with me as they saw that I produced nothing interesting. I watched as people paid less attention to me, as I became irrelevant. That is a particular kind of loneliness in research. Knowing you are working hard on something, but having nothing to show for it.

Getting a Foothold: Learning From Inspecting Trajectories

To get started on going agents I hurled myself into the depths of coding agents to build intuition and understanding that would guide my future research.

As a proponent of open source, I naturally worked with open-source models. When I looked at SWE-bench performance for an 8B model, it was basically 0%. SWE-bench is a benchmark where you receive a GitHub issue describing a bug in a codebase and need to fix it. Closed-source models at the time performed around 30%. The 8B model had 0%.

My first challenge: could I improve this 8B model as much as possible and make open source competitive?

Being resource-limited forces you to run fewer experiments, making each one more important. My first step was to understand and learning how to interpret trajectory data. So I download trajectories from frontier models and study them. I examined two things: the workflow models used to solve bugs, and the problems they encounter. This analysis helped me identify key patterns and strategies that I could adapt to improve open-source models.

Every closed-source coding model learns the same workflow, one that closely mirrors what humans do. They read the GitHub issue and try to understand where the related functions and classes are in the codebase. This leads to “rabbitholing” (as Ethan puts it), where a model goes both broad and sometimes deep into the code to find the bug and understand it thoroughly by examining how functions relate to other components. This builds the understanding that the model needs for the next steps.

The next steps can be interchanged: write a bug fix, then a replication script, or vice versa. Technically, writing a replication script first to verify the bug is better, but closed-source models sometimes interchange these steps. Then comes an iteration stage where editing continues until the replication no longer throws an error. Finally, you create a patch and submit it.

The Copy Problem

Looking at what was going wrong with small open-weight models, I found the core problem: they are very bad at copying information. This is critical for tool calling. If you want to explore a codebase, you need to reference function names exactly. These tool calling errors often confused small open-source models, where they would generate a reasoning trace of “This does not work. Let me try something else”. These small imprecisions stopped small open-weight models from getting any solutions at all.

The fix was straightforward: do a fuzzy search against existing functions so tool calls are made against the right function (instead of non-existing ones). This immediately led to an improvement from 0% to 7% solve rate. From there, I could run more trajectories and see other failures. The model sometimes failed to produce the closing tag required for tools, leading to no tool call and confusion that resulted in errors. Ensuring tags are always closed for tool calls pushed performance to 14%.

The big lesson: open-weight models were not good at copying information, while closed-source models excelled at it. Armed with this knowledge, I made further changes to make smaller models more robust when referring to symbolic information.

I pushed the 8 billion model to 24%, very close to GPT-4o at 25% for the scaffolds at that time. These quick improvements were mostly driven by my experimental methodology. I carefully estimate the variance in results so I can, with precision, say what works and what is noise. This was something I learned from writing the QLoRA paper, where I ran thousands of small experiments to verify which variables mattered at a larger scale.

This was exciting. I felt I was getting a foothold in coding agents. By November 2024, people at AI2 were interested in applying these methods to their agent workflows. Nobody at AI2 was working on coding agents specifically, but people were working on scientific agents, so I shifted to building a general agent framework. Then came the health problems, and the pause.

The Setback

When I was starting to make progress, misfortune struck. I developed health problems that forced me to take a break. While that was a difficult decision, I stopped working in February 2025.

I had committed to hiring an intern, Ethan Shen, who was highly recommended with an impressive background. Even though I could not work and program myself, I wanted to mentor him. Ethan started one day before Claude Code was released. He had not worked on coding agents before.

But Ethan learned quickly. Remarkably quickly. Together with my mentoring, he made rapid progress. A couple of months into the coding agents project, he would reach out to other researchers about coding agents, and it turned out he often had more insights into what determines good performance than they did. He became one of the most knowledgeable people in the coding agent space. I like to think I had a role in Ethan’s growth, but really, Ethan is just exceptionally good.

When Ethan and I achieved our first state-of-the-art results with the subtask-splitting method, it proved that all the struggles and setbacks were worth it. This progress can inspire you to keep pushing through your own challenges in research.

Data Generation Methods

When I fully returned to work after my health-related break, the landscape had shifted. Qwen 3 had been released with improved reasoning capabilities. Open-source models now have the precision to make mostly correct tool calls. The bottleneck was no longer scaffolding. It was data.

Working with real training data from real GitHub issues was not scalable. Data with real data is difficult, particularly when you have only a single person new to coding agents working on this. So the choice was obvious: the only scalable solution was generating synthetic data.

At the time, most people wanted to generate correct training data: find GitHub issues with bugs, which gives you the final solution (the patch), the faulty state (the bug), and the repository. Then generate synthetic training data from a larger model’s rollout, which you use to fine-tune a smaller model.

For efficient synthetic data, you need a different approach. The core problem is generating labels. A simple approach: start with the label, then mutate your data to generate the example. This means: start with a correct codebase, generate a synthetic bug, then use the bug as the input and the correct codebase as the target. This makes scaling easy.

But here is where the resource constraint bit hard. Coding agents usually need massive amounts of compute. And we only had 32 GPUs at that time. This is the same problem I faced with QLoRA: you need resources to learn to be resourceful, but you do not have them yet. Eventually, it becomes a virtuous cycle, a flywheel. But first, you need to get it spinning but getting some level of efficiency to make your resources sufficient for further progress.

We tried three approaches. Each one taught us something that led to the next.

Approach 1: Subtask Splitting

Code search is one of the most important problems in coding agents. If you cannot find the bug, you cannot solve it. If you find it, you have a good chance. So the most important thing to learn from a trajectory is searching the codebase. Why not learn that first, then learn editing afterwards?

This has an advantage: subtask methods might transfer across problems. Searching a codebase is useful for many digital problems. Editing is used to add features, fix bugs, and more. By splitting into subtasks, we hoped to gain efficiency that spreads to many subtasks. Doing well on search first would make learning to edit more efficient, since we could already find bugs with higher precision.

The problem: you have no labels for subtasks. But you can use a mutation method. Start with a correct codebase, mutate it into a problem state, then use the original state as the label and the mutated state as input. This generates countless tasks for learning to search. We refer to this problem as the proxy task problem: how to find a task that indirectly helps you learn what you care about.

Finding Good Proxy Tasks

How do you find which task is good for generating data? There is a simple, effective method. Take multiple models, generate a proxy task via mutation, evaluate them on both the proxy task and SWE-bench, and then compute the correlation.

A relevant task shows a correlation close to 1: better proxy task performance means better SWE-bench performance. A bad proxy task shows low correlation.

The best search proxy task that dramatically improves the performance on SWE-bench and which has almost a 1.0 correlation with SWE-bench performance is the following: take a function, have a language model summarize it, then create a task where the model gets the summary and must search the codebase to find that function. This task has almost a one-to-one correlation with SWE-bench, showing search is critical, and you can easily create tasks that mirror this.

Fine-tuning on just 500 samples from this task makes 8 billion models almost as good as frontier models at searching codebases. This heavily boosts open-weight model performance and easily reaches state-of-the-art results (at that time). This was our first state-of-the-art result.

There was a thought to publish here, but I have a tendency to push further. My thought is this: a bad paper has no impact, a good paper has some impact, a great paper has a massive impact. If you have a good result, why publish? Go for the great result instead. What if you get scooped while you try to get a great result? Not too bad. You quickly publish, and you still have a good result.

Ethan saw it the same way and we pushed on.

The Editing Step

With the search solved, the second step was editing once you found the function.

Our method: take a correct codebase, insert a bug, describe it in a GitHub issue like SWE-bench, and provide the altered codebase with the bug plus a search trace from the first subtask. The model uses the search context pointing to the correct function and learns to edit it.

Around this time, John Yang and some close friends released SWE-Smith, using the same method. The difference: they used unit tests to verify that generated bugs actually break things. We did not have resources for that infrastructure.

Despite unverified bugs, we got good results. However, while search needs only 500 examples for near-frontier performance, editing is different. It requires precise updates, so we needed many samples with many different bugs.

The procedure was efficient, as we could append the search trace and reproduce the bug. But we realized that the editing model was as computationally expensive as an end-to-end approach that combined both steps. And when we ran an end-to-end experiment, we got results nearly as good as our split task generation with the only difference being that end-to-end is much simpler. So we decided to switch to end-to-end generation and training.

Approach 2: End-to-End Generation

End-to-end generation is essentially something like SWE-smith: (1) mutate a correct codebase to get a bug; (2) create a GitHub issue; (3) generate rollouts from a model; (4) train on the data (GitHub issue, mutated codebase, trajectory).

Generating lots of data is expensive, so we settled on not verifying or running tests. We had basically only Ethan working on the project at this point, and this full verification method would not be feasible. So we first tried to do hard verification of our data and instead we settled on soft verification.

Soft verification means we trained on data that could be incorrect. We only compare the generated patch with the mutated patch, looking for partial line-by-line overlap. If the model generates a patch that overlaps with the target patch by 50%, we accept it as training data.This reduced costs significantly, since tests consume time and CPU cycles for test verification.

We also needed a cheap model for data generation. GLM 4.5-Air and 4.6 give strong SWE-bench results and are open-weight, but the full GLM 4.6 is too big for the 32 GPUs that we needed for generation, training, and evaluation. The GLM 4.5-Air model hit the sweet spot: strong performance at low cost on a few GPUs. This gave us low verification complexity and efficient generation.

Cost-effectiveness spins up a flywheel: more efficient resourceuse leads to faster experimentation, which leads to faster progress.

Since we settled on no hard verification, our method is feasible with the 32 GPUs we had at that time.

But doing this revealed something unexpected.

Approach 3: Soft-verified Generation – Embracing Vagueness

When we generate bugs, you need to provide the model a hint of what the bug is without giving it away. This is not easy. This means one often has a function as a starting point and a vague description of the bug. What we did initially is to analyze the call graph of a codebase, then find functions that build a chain func A -> func B -> func C. Then we would prompt the model with “There is a bug downstream from A”. With this vague description, the model often produces all kinds of different “bug fixes”. A bug from function reuse might lead to a solution looking more like refactoring than a bug fix. We have refactoring traces even when asking for bug fixes.

At first, this seemed like a problem. But at this point, Saurabh Shah joined the project. He built reinforcement learning (RL) infrastructure and trained a lot with RL. He told us that his is actually pretty common for RL too. Training on incorrect data often gives good performance.

So we doubled down: instead to tring to fix this incorrect data, we embraced it, and that led to something unintuitive.

Instead of trying to generate training code that is as correct as possible, we do something simpler – and something slightly wacky. Take a correct codebase, select a random function, and tell the model there is a “bug” downstream from that function (even if there is no downstream function).

To prompt a model with a correct codebase and “There is a bug somewhere here. Please find it and fix it” is ridiculous – there is no bug! But when you realize that most of the generations are never about bugs anyway, this is more sensible. And the other side of this is that it is very cheap, and very easy to do. It is just prompt + function.

The model explores from that function, looks at the code, and imagines a bug that does not actually exist. It might uncover a missing edge case, a missing assertion, inaccurate documentation, poor code that needs to be refactored, or a “real bug”. As such, while instructed to fix a bug, the model creates many other sensible improvements to the already correct code.

To keep generation cheap, we continue the pipeline as usual in the second rollout: generate a GitHub issue from that first rollout and its patch, then do another rollout starting from the GitHub issue and the correct codebase to finally create a second patch. This leads us to two trajectories: one with a GitHub issue and another without.

This generates lots of training data from just random functions in a codebase. We compare the overlap between the two patches with soft verification. Two trajectories in succession make everything compute-efficient and simple. All you need is a codebase with functions and a model that can generate trajectories through two calls.

—

SERA: Soft Verified Efficient Repository Agents

This approach gives us a much higher data limit. Typical data generation pipelines are limited by how many sensible bugs you can create. Verifying bugs to actually cause tests to fail further limits this. You also need a starting point that does not reveal the bug location. This means many other synthetic data approaches can only generate a very limited amount of data from a single repository.

We have none of these limitations. From a codebase with a thousand functions, we can generate a thousand trajectories. To amplify further, we examined papers analyzing common engineering bugs, getting a distribution of 51 bug types.

For each function, we can generate 51 different bugs. A codebase with 1,000 functions yields 51,000 trajectories easily and cheaply. We are almost not data-limited – even for a single repository.

This result leads to a core advantage of SERA: private codebase specialization.

Closed-source models do well on popular programming languages and common patterns. But on less-represented languages or proprietary codebases, they struggle. The holy grail is to specialize a model by training it on private data without exposing that data to a provider.

We generate massive amounts of data for a particular repository. Generating 7,000 trajectories takes just 19 GPU days. Training on these trajectories gives us a 32 billion model as good as the teacher, GLM 4.5-Air, the model from which we generated data. This works for other teacher models too, meaning you can compress frontier-level performance specialized to private data into a small, easily deployable model.

With enough data, we exceed the performance of the teacher model. This makes sense: we generate data from private repositories that the teacher has never seen. The student can exceed the teacher through specialized data.

This is a massive result. It helps companies quickly specialize a small model to frontier-level performance, or exceed it, on their private data.

Simple to Use Infrastructure Including Claude Code Integration

Ethan, Saurabh, and I knew we had something special, but we wanted to have a nice release and we were thin in terms of building that release. So Danny Tormoen joined us – and boy, did he do a good job!

Danny very quickly developed a converter that adapts the SWE-agent toolcalls to be compatible with the Anthropic API and convert responses back to what SWE-agent understands. Danny also created a deployment script for Modal that you can run without a credit card. You just need to create an account and you can try our agent in Claude Code for a couple of hours. All these tools are usable with 1-2 lines of code or commands. The same is true for out training and data generation pipeline.

With this, we could quickly start dogfooding our model in Claude Code. Initial impressions are “a pretty good model for its size,” but the model also has a habit of wanting to submit patches after some iterations – a clear artifact of the training procedure.

What This Means For You

With SERA you can easily and cheaply train high-performing coding agents on your personal of proprietary repositories and deploy these agents seamlessly in a Claude Code interface. The Claude Code is seamless – you can deploy with a single line of code.

A 32 billion model will not give you frontier performance right now. But it gives you a strong, specialized model for your private code, and you do not need to send your data to any provider.

More importantly, by tinkering with this, you gain skills. Once more advanced methods become available, you will be ready to use them. The trajectory of this field suggests that matching or exceeding frontier-level performance on private data is within reach. And SERA is a very exciting result in this direction.

I am very excited to see what you will build and do with SERA — the first release of Ai2’2 Open Coding Agents family — and specialized coding agents for your data.

Conclusion

Because our method is cheap, it opens coding agent research to everyone. You do not need large teams to maintain complex reinforcement learning setups. You do not need thousands of dollars to replicate a baseline.

Our baseline, which beats the previous state-of-the-art open-source method, costs $500 to run. Any lab with a few GPUs can now do coding agent research.

I would not be surprised if academic labs reproduce frontier-level coding agent performance for private coding data by the end of 2026. Open-weight models are already approaching that level and I think we can shrink everything done to 100B maybe 32B models that are super easy to deploy and use. With our method and future methods, you can specialize in your private data without exposing it.

There is a real promise here: open-source models that exceed frontier performance because you use your private data, train your own model, and deploy it yourself.

This is how we cooked our coding agent with a hot plate and a frying pan. The industrial kitchen is nice, but it is not required. Sometimes constraints force you to find simpler solutions – and sometimes those solutions turn out to be better.

The post My Journey Towards Coding Agents: Building SERA appeared first on Tim Dettmers.

Use Agents or Be Left Behind? A Personal Guide to Automating Your Own Work

Tim Dettmers — Tue, 13 Jan 2026 12:56:37 +0000

If you are reading this, you probably feel the FOMO. Maybe you have seen the Twitter threads about coding agents completing entire features in minutes. Maybe a colleague mentioned they are “10x more productive” now — or “Influencers” saying AGI is here and you need to learn their particular thing now. Maybe you tried Claude Code and felt confused about why the magic everyone talks about is not working for you. This blog post is for those who want to cut through the hype and understand what actually works, what does not, and how to think about using agents to automate your own job further and further to be more productive.

I have been using agents — primarily Claude Code — for eight months to automate my own work. What you will read here is not speculation or theory. It is the product of hundreds of hours of experimentation, many failures, and some surprising successes. As a professor who does not write much code anymore, my perspective is different from the software engineering discourse that dominates Twitter. Most of my agent use is actually for writing — blog posts, grant proposals, meta reviews. While these problems might be non-traditional, they provide the exact view of how to use coding agents for all kinds of tasks even beyond coding itself. This helps you understand how far you can go in all different directions of agent-use.

Just to give you a hint of how powerful agents have been for me: usually, the first year as a professor is very stressful and involves a lot of work. For me, it felt easy. I had some luck here and there, but I believe the use of agents is a significant reason why things have been manageable for me when they are hard for others.

This blog post is my attempt to share what I have learned so that it might help you grow in the long term. I will detail a number of different things I have built with agents — which succeeded and which failed. There are plenty of blog posts out there about software engineering with agents; this one tries to give a broader, more balanced view.

Before I worked in AI, I spent three years in the automation industry in Germany, developing SCADA systems. That experience taught me how to think about automation systematically — when it makes sense, when it does not, and how to build skills over time. I bring that perspective here, along with concepts like process optimization that are standard in manufacturing but rarely discussed in the context of AI agents.

What Is Hype and What Is Real

This blog post is mostly about my own experience with concrete examples of successes and failure cases. This is real. And I want to contrast this slightly with the Twitter discourse, which can create unnecessary FOMO.

Hype: Vast parallelization, large productivity increases, and autonomy

What you see on Twitter is mostly about software engineering. And while agents in software engineering are real and powerful, software engineering is very unlike most other problems.

Firstly, in software engineering, you often have many parallel problems that you work on independently: bugs, new features, quality control and refactoring, GitHub discussions, and reviews. All of these tasks are independent and can be parallelized. The larger the codebase, the more independent work can be done, and the more parallel sessions become useful.

But here is the thing: the concept of parallel sessions is an agentic workflow pattern that is useful for software engineering and some other tasks, but most tasks cannot benefit from parallelization.

Secondly, while productivity gains in software engineering are real, they do not automatically translate to all tasks. Coding is a very general capability. Theoretically, you can do anything digital – which spans a lot of tasks. But in practice, automation of many other non-software-engineering tasks is very difficult or has small payoffs. Later in this blog post, I will give you a framework for how to think about this more broadly.

Thirdly, while a fully autonomous system can be impressive, real work that is useful often creates design decisions. Iteratively designing a system with shorter bursts of autonomy with feedback loops where they shape the final solution that you want can be much more effective than just rolling out agents until a solution is achieved. It works, but it is not very helpful for most work because the quality is too low.

Real: Agents Should Be Used Everywhere

This blog post is about using coding agents for all kinds of tasks and how I learned from that experience. After 8 months of Claude Code use and trying to automate countless tasks, here is my honest assessment: more than 90% of code and text should be written by agents. You need to do so, or you will be left behind.

This statement might seem controversial and as FOMO-inducing as the software engineering story I just critiqued. But I believe it is reality, and understanding how to adjust to that reality will be a big part of everyone’s jobs going forward. This blog post is an attempt to share my knowledge to help you on this path.

When I talk with people about this, a lot of people push back vehemently. Generate all your work with AI? They find it ridiculous. How can generic boilerplate generation replace the intricate style of a well-designed software system? It feels absurd for them to replace the immediately noticeable and distinct style of a writer with an AI-generated slop wall of text.

Why AI-Generated Content Is Personal, Not Generic

AI-generated content is personal content. When I explore a concept with Claude, the output is not generic. It is shaped entirely by my thinking, my style of conversation.

I really like connections between fields, and the topics I explore are highly personal and unique traces of my thinking — and with that, my taste.

Let me give you a vivid example. I once started a conversation about jihad — the concept in Islamic theology that is often misunderstood, about the inner struggle to do the right thing when it is difficult, but it really matters. From there, I ended up connecting it to Krishna’s advice to Arjuna about doing the right thing and to not worry about the outcome by surrendering the fruits of actions to him (karma yoga), to Lutheran grace that is purest when it emerges at the height of struggle, to Taoist Wu Wei where struggle disappears through letting go and letting your nature take over, to Beowulfian naked will against overwhelming odds — the struggle with Grendel as a symbol to surrender to your fate.

None of this exists in any textbook or on the internet. It is a fingerprint. A very personal fingerprint. If you were to read the details of these conversations, you would know parts of me intimately — who I am and why I am that way. You would know me to a degree that is usually only reserved for close friends and your partner.

Someone thinking that AI-generated content is impersonal and generic is deeply mistaken. The concept of soulless AI generation is an artifact of less powerful AI, or the mistake of seeing your own generations as what AI can do, rather than recognizing it as a skill issue that has to be overcome.

Useful Background: How to Think About Automation

The Basic Calculus of Automation

Before I worked in AI, I worked in the automation industry in Germany. I was developing SCADA systems — integrating data from machines and databases to enable the control of workflows via data and monitoring. The knowledge gained in these three years in the automation industry applies directly to automating your own work with agents.

The first important question is: when should you automate versus when should you not? While people always think about automation in terms of full automation, this is almost never the case in practice. You always have a degree of automation.

Here is how to think about useful automation: if you take your current degree of automation and increase it by a new technology, then you improve by a certain percentage. If a task takes 10 hours and you improve the degree of automation by 10%, then it takes 9 hours.

With this view, there is a simple calculus: how often do you do the task, and how long do you need to automate this task to improve the degree of automation by 10%? If this calculus leads to a result where the cost of automating something is higher than the gain, then the problem is not fit for automation. The task should be done manually. There are many tasks that should not be automated because it is not effective. I will give a length example about email automation, where I tried hard, but it is a problem where automation fails.

Additionally, as your workflow changes, it adds overhead. For example, you might save 30 seconds, but if your agent needs 30 seconds to generate that automation, then the effectiveness is 0%. The degree of automation improves by 0%.

If you invest so much time into improving your work with agents, you want to make sure that it actually helps. This simple calculus – while not perfect – is a simple tool to help you decide where to start with automating your work.

The Method: Process Optimization

A very basic method in factory automation is process optimization. You are on the factory floor with a stopwatch. You look at and study what workers are doing. You time each step: how they hand off work to another person, when they wait for previous work to complete, how many resources they need, and if they wait for resources.

If you have all these components, you can construct a workflow — a process that reflects the current state of how work is done. Based on this, you can think about how to optimize. This thinking is extremely important for automating your own work. You have to think about how you work on a particular problem and how you can change your work with agents. Using this framework of process optimization can be extremely helpful to get a quick sense of how much productivity gains you can achieve with a particular solution. Sometimes you find that the process cannot be optimized much with agents – that saves you a lot of time.

Let me give you a concrete example. If I take one minute to read an email and 30 seconds to reply, then it takes 1 minute 30 seconds to complete an email. Now, if I use an agent to help me with my emails, then I need to guide it to process my emails. Then I need to read that content to decide how it should create drafts or answer emails so that I can then edit those drafts. But once you do this exercise, you realize that by using agents, you just shift the process and do some automatic generations — but you still need to read content, make sure it is aligned with your intent, and you need to edit the draft if it does not match exactly with what you wanted.

There are certain emails that are easy to automate. There are others that are not. It depends on your process and your underlying inputs to see if using agents and changing your process can actually lead to productivity.

Reading an email or reading AI-generated content has a cost. You need to include that cost in your process optimization thinking to understand if your process can benefit from automation. This insight — which seems obvious but is often ignored — is fundamental to achieving higher and higher degrees of automation.

The Long-Term Perspective: Building Automation Muscles

The process perspective I just gave is a short-term view. You look at the underlying processes and the degree of automation, then think about how long you will need to automate that work and how much you increase the degree of automation. It is a simple calculus. This is classic automation, how it is done in Germany and other countries. Very cost-effective and optimal in the short term.

However, it is very short-sighted because it does not consider long-term consequences.

The long-term view is a Shenzhen-style perspective. It is not about making any automation useful in the short term. It is about making automation useful in the long run by gathering knowledge that improves automation over time.

It is essentially short-term calculus with a meta-automation step added: even if the degree of automation is not worth it in the short term, will the skills I build and the tools I develop make future automation effective that was previously ineffective? Does the additional knowledge help me with future automation effectiveness?

This is exactly what led from Shenzhen-style scrappy factories to highly structured dark factories that are fully automated. Chinese automation is far superior to Western automation, not because of scale, but because the long-term view of automation led to a higher quality and degree of automation.

This is an important concept. You need to optimize both short-term and long-term perspectives to effectively automate your own job. Europe is struggling because of its short-term view of automation. The US is struggling in many segments because it did not build the long-term skillset that is required to build the automation muscles to tackle more challenging problems.

In other words, using agents and failing at automating a task effectively is important. You need to gain skills to improve future automation, and that means sometimes trying to automate things that you know you will not be able to automate.

A key part of making sure you learn over time is important. Often, you learn more from failure than successes and with agents, it is no different.

Why Automating Your Job Is Good for You

Software engineers are not replaceable. They just level up. The current hiring challenges are driven by COVID and financial dynamics much more than by AI. Software engineers are now much more effective at building software more rapidly, and the value of software has not decreased significantly. An engineer who uses agents like a hot knife slicing through butter is actually more valuable because they can produce more software that still has significant value.

A common view, particularly from the Bay Area, is that this is the current state, but software engineering will be fully automated very soon. I have many friends at frontier labs who had this view about nine months ago, but it has broadly changed. They see that it is difficult to automate their own work and that, as they use these tools, new problems open up.

Even if an agent can do everything, it cannot do everything at the same time. If you have a limited amount of GPUs, you want to direct agents to tasks that are useful for you so they can generate value where you need it. While even that can be partially automated, once your existence is at stake, you probably want to direct what agents do yourself — at least specify the problem and solution you want.

I think it will be a long time until you use an agent to manage your retirement savings by analyzing the stock market and optimizing it fully autonomously. But what is more reasonable is that you build a system where you tell an agent what risk you are happy to accept and how to optimize this risk through hedging, so that you might manage your retirement fund with a trade-off between potential upside and risk over time. It would be unwise to fully trust an agent if you do not know the parameters that are important for you and how the agent chooses those parameters.

If resources are limited, you want to decide how those resources are used rather than fully trusting an agent. And if this is true, then directing agents will remain a problem, even if agents can do everything, because agents cannot do everything at once, because resources are finite.

Long story short, because of this resource problem, there will always be work where your personal preferences, decisions, and taste will be needed — even if 90% of the work happens through AI. From software engineering, we already see that these changes work, but they will not eliminate many jobs that we thought would be automated away quickly.

I think the other direction is actually more pressing: if you do not know how to use agents, you will not have a good job or be able to find a job. Agent use is now an essential skill that you need to develop and master.

My Personal Experience with Automating My Own Work

Personal tools and pipelines

What is most common on Twitter are examples of successful agent use, where people create a tool that is useful for them. Small extensions that help your everyday life — just vibe coding something that you always wanted and that is simple, but nobody provided.

While this is a simplistic way of using agents, it has its importance. This is a problem where agents work really well, and they require very little skill to be used correctly.

For example, I built tools that help me write this blog post. I built tools that help me work with agents. One of the most important tools is a voice tool, which helps me quickly interact with agents, particularly for parallel sessions. A voice tool also helps me because I have carpal tunnel in both my hands. Typing can be painful. I have a very custom keyboard layout and way of working with the keyboard that reduces pain to almost zero, but still, it is much more comfortable to just use my voice. And it is not only comfortable, it is also faster.

A main advantage is that with voice, you can inspect outputs and use your keyboard and mouse while narrating. This is extremely powerful. A key tool that everybody should develop is their own voice tool to use AI in this way, where they can do work while narrating.

Tools for Students

Finding related papers: Replication of Connected Paper

Another tool I built was to solve the problem of finding related work. The most useful tool I have ever used for this was Connected Papers. It was free at some time, but then it became commercial. I need something like this at the beginning of a project and when writing the related work section of a paper. I did not want to pay for the subscription, and I wanted my students to be able to access it. So I just replicated the entire software system.

This was probably not effective for automation in the short term — I could have just paid for Connected Papers subscription. But it gave me an overview of what I can do in the long term: what tools can I build, what is too ambitious, what is less ambitious, and how can I be more effective when creating complex tools.

My connected papers replication uses the Semantic Scholar API to retrieve data. Then it builds statistics on the citation graph of papers to find papers that are very similar to what Connected Papers finds. The key insight I had is how Connected Papers works: it finds papers that are in conversation. They are often indirectly related through a third paper that cites both of them. They create a chain, a loop of three papers, and this loop is distributed. If you have this loop across three-paper chains that create circles, you have a very good way of finding related papers.

The tool that I created was very useful, but here is where it failed: the user interface. The algorithm works well, making the software easy to use for others turned out to be the hard part. My students needed to execute a Python command and use a password to extract an API key – it is a mess to get started. Instead of local deployment, I should have built a deployment that is just a regular website that you can access anywhere with just your browser.

If you want to create a tool that improves the degree of automation of a task, a useful tool is often not enough. You need to figure out how you and other people can use these tools intuitively and effectively.

You see that even creating simple tools like this connected paper replication, can have its own complexity. But such failure cases give you perspective: While not highly successful in the short term, it will give you the skills needed to tackle problems more effectively in the long-term future.

I would encourage you to spend some time on projects that do not offer a high gain in terms of degree of automation, just for the sake of having more diverse points of failure that you encounter, which will inform your future automations.

Exploiting coding agents as an API

Other tools I build are mostly for my students. It was recently revealed that quite a bit of Claude Code use is actually by exploiting it as an API for regular LLM calls. This means you use a Claude Code endpoint just as an API to use it in other work. While I do not use Anthropic for this, there are other providers where you can get frontier capabilities at the cost of 1% of the usual API costs. So you get regular API calls, but at 1% of the price.

I built this pipeline for my students a couple of months ago, and when I asked my students if they needed any GPUs for their projects, they said no — they just generate evaluations and research directions with the API that I created. It has been a very useful tool. For a research group, having easy access to frontier model capabilities without the cost that is typical for APIs is liberating for research. And I built this tool in about 2 hours. This is where good tooling really paid off.

Other Tooling for Students

Other tooling that is much more straightforward includes infrastructure for Slurm and a common infrastructure to analyze experiments. I believe Weights and Biases actually harms research by biasing the interpretation of results and how experiments are run, and so the custom tools that I have in the pipeline will help my students to avoid this bias.

A tool I have not developed yet, but which my colleagues have mentioned is a review system where students can get feedback on ideas or papers by querying an agent or an agentic pipeline that mimics how they, their academic advisor, would give feedback to the students. Imagine a student being able to get a first-pass critique of their paper at any time they like without being embarrassed about it or worrying about perceptions. This would not replace advising, but it would make our collaborations more productive by handling the basic structural and clarity feedback automatically.

While not all of these tools might be useful, and some are more like distractions, it is clear that with the right pipelines, workflow, and tools, productivity for students can be increased dramatically — and this can be driven by an advisor who invests in building these systems.

Similarly, a technical manager can develop tools and guide a team in this way. Even if agents cannot do all the work, you need to figure out what work you actually want to do and how you want to build on each other’s work as a team. Agents can work independently, but it might not be useful if your team is pulling on different ends of a problem. If coordination is missing, and everyone is using agents in their own way, it can lead to disaster. The tools an advisor or manager builds can provide that coordination layer.

All of these examples highlight where tools fail, where tools can be useful, and where tools might not be useful in the short term but give you the skills to improve tools in the future.

Writing Tasks

Blog Posts

You might have guessed it already: this blog post is AI-generated. My previous post about Why AGI Will Never Happen was AI-generate too. More than 95% of the text from both blog posts comes from an AI model. I did not even type prompts. Most of it was just me rambling into a microphone while doing other things, then transcribing that voice into text, shaping it into a blog post, then doing a style transfer to shape it into my voice, and then adding some small snippets that have character.

The editing and adding the small snippets, this last 5%, is a cherry on top that is very important. But they key point stands, 95% is AI-generated by I bet you still find this useful and enjoy the read. It has my style and my voice of writing and presenting information. Processing information in this way, to really make writing personal, is not that difficult if you use AI agents well.

While I am still experimenting with blog posts, this pipeline allows me to write blog posts much more quickly — and blog posts that are much more current. A blog post like this would have taken me days in the past. Now it takes about 3 hours: one hour to speak content into a microphone, 10 minutes for my agentic workflow, and then ~2 hours of reviewing and editing. It is very fast, and when using the agents, you notice that the quality is pretty good.

Would you agree that this blog post has soul? Or is it AI slop now that you know it is AI-generated?

Writing Tasks: Grant Proposals

Grant proposals are a major time sink as an academic. A CMU student costs $150,000, and I need to find that money by writing grant proposals. A lot of proposals are rejected, so you have to write lots of them.

It is interesting because while you might think the blog post approach should work, it actually does not work that well. Grant proposals need to have a particular structure, and even small deviations read poorly. Good design is familiar design, and good proposals are familiar proposals.

This is just like a good abstract — for example, an abstract in Nature has almost always the same structure, sentence by sentence, the same for every paper. That makes it easy to read abstracts because you know where to find information.

I am dyslexic, and reading is very slow for me, but I learned to read papers at a relatively okay pace because I understand that they have a common structure that repeats again, and again, and again. I can skip sections, skip to particular phrases, and I know where an interesting part begins. If the introduction says “In this paper” or “Here,” then I know now the list of contributions starts.

Grant proposals are highly structured. A free-flowing, talkative approach that I use for blog posts does not work out of the box, but it can be made to work by introducing an abstraction pattern.

This abstraction pattern works as follows: you create sentence-by-sentence extractions of what the grant proposal content should be. For example, for an abstract:

The first sentence is about the general field and the subfield
The second sentence mentions the problem, why it is important, and why it has not been solved
The third sentence states your contributions and your main results
The fourth sentence explains your main method
Then, depending on taste, you expand on this method or keep it brief
Finally, you state the impact and broad implications

If you have an AI model, you can apply this process very easily. Just take a couple of grant proposals of your own or others that you really liked. Then use an agent to do this, sentence by sentence, then merge multiple abstracted structures by commonality.

Then I use this structure together with an agent to create an interactive flow: The agent gives me particular questions, and I respond with a voice message about that content that I want – this is often casual “rambling” about my research that I want to do. After each response, the agent stores the content and analyzes the abstract template if key information is missing. The agent asks follow-up questions, and I answer them with my voice tool.

I then have the agent generate the draft and then smooth it over by doing style transfer using particular proposals that I have written and like.

With this, I can create a four-page grant proposal in about an hour and a half — even faster than a blog post.

Meta Reviews

Machine learning conferences are notorious for bad reviewing. The reviewing system is broken. There have been studies on ICLR and NeurIPS with clear results: reviewing does not work. Reviewing can identify the really worst papers and the best papers, but in-between it is a coin flip.

The finding from these studies is that reviewing quality is not related to knowledge but related to effort. Undergrad students have much higher quality reviews than PhD students or professors because they have more time and take it more seriously. For PhD students and professors, it is a chore.

Looking at that reality, using agents becomes very straightforward, and I would argue an imperative to improve review quality by reducing the time needed for reviewing.

In this case, we look at meta reviewing, reviewing the reviews, which is the task of an area chair. There are two philosophies about being an area chair. One is that you bring your own opinions and overrule decisions. The other is to follow what the reviewers said. I believe the second is more intellectually honest. While I have expertise and sometimes will overwrite reviewers, I have not had the depth to read every paper thoroughly, and certain concerns might be valid. A good paper is not a paper that I like, but a paper that is useful for the research community, and usefulness is difficult to judge if you do not read a paper in depth.

What I built to help with meta reviewing is a system to analyze the discussions, the points where reviewers disagree, give summaries of papers, summarize which papers are borderline, and identify which are clear rejects or accepts. The clear accepts or rejects have high score variability. These reviews can be processed quickly — you can understand why people have certain views.

The workflow is as follows: An agent uses my OpenReview login details to log in, navigate to the papers, get all the reviews and rebuttals, and store them to disk. Then the interactive part with the agent starts that helps me to understand where the issues are.

What is more subtle is tracking changes in the discussion. With the rebuttal, even if the score is not increased (which is very common because people do not have time), the rebuttal might contain information that could change the outcome.

From all this discussion about borderline cases, you can easily draft the first meta review. If it looks strange, you can ask the agent to explain, provide more detail, or provide evidence. It is a very interactive way of reviewing and actually mirrors what I would do without AI agents: separate straightforward and difficult cases; analyze difficult cases for disagreement; figure out which arguments have merit and if author rebuttals change the picture; draft a review; edit by looking at details; submit.

All these things can be done by an agent, and they can be done faster and probably more precisely. Understanding a subtle argument of a paper I have not seen before, between reviewers with different perspectives — this is hard if it is 5 PM and I have already had eight meetings, and I am just tired. But if I do it with my voice tool and my meta review agent system, this allows me to write high-quality meta reviews and make decisions that consider all information and arguments of reviewers and authors carefully.

The use of agents for meta reviews might be highly controversial, but again, AI-generated content is highly personal if you do it right. This also goes for reviews. I think we do a disservice to the research community if we do not use agents for reviewing, since they can improve quality dramatically.

Where Agents Fail: A Study of Email Automation

I alluded previously that I tried to automate emails for a long time. Over two months, I worked quite a bit on automating emails — for one, because I do not like emails, and also because it is now a major part of my work.

I wanted to build a system that helps me manage, prioritize, and draft emails. Probably for most people, the process of “doing your emails” is similar: Categorize emails into urgency, map out information that is needed to make a reply, and prioritize replies with the time that you have now until your next meeting or other event.

Doing this manually is very simple and fast. I can often look at the title and immediately sort it into a category. I can skim an email for 10 seconds and know if I need to reply now or if it has time. I can organize emails in bulk to review later.

My initial attempt at email automation was very focused on features. Can I do this categorization, prioritization, bulk sorting, and get the gist of an email with agents?

But here is the issue: even if you automate all of this, you still have a similar workflow. If you categorize an email automatically, you still need to look at the categorization to see if there are new emails. If you have an AI summary of an email, you still need to read it. If you create agent-generated drafts, you need to look at the drafts and see if it has the right details, the right tone, and actually say what you wanted to say more broadly.

Furthermore, Gmail is a familiar interface. You know where everything i,s and all these things of prioritization, categorization, etcetera, can be done easily and quickly. If an AI does that, many things are automated, but you still need to use a user interface. This interface may be unfamiliar, not optimized for all workflows, or might miss crucial information or features. And navigating and using an agent-driven email system costs time, just like how it costs time do it manually.

Here, the process optimization view kicks in. If I can categorize an email within five seconds, that is pretty fast. An AI agent needs to beat that in five seconds and be more precise than I am for it to actually be useful. While the reading and categorization can happen in the background, with an AI-generated draft, I still need to navigate to that draft and read it. That might take 10 to 30 seconds just for navigation and reading, but an additional 1 minute for editing the draft. In many cases, the manual approach is about equally fast. But if you add the development time for this system (it was more than 100 hours), it becomes clearly net negative in terms of productivity to use the agentic system.

Despite all these edge cases, I did not want to give up. For one, I really do not like emails. But the second part is that, for me, it was a challenge: can I automate this task? And if I cannot, it would serve as a hard-won lesson for future automation challenges.

So I made a second attempt. I knew about the process. I knew about the importance of interfaces and how to structure information. Since I am an avid Vim user, I wanted to build a vim-optimized interface. This was a long process — co-designing functionality, agents, and the user interface. My productivity using the agentic email system improved day by day, but at some point I saw the improvement plateauing, and I asked: Is Gmail, if I use it the right way, faster?

So I compared time spent on emails between the tool I created and just using Gmail – which is very much the process optimization view of having a stopwatch on the factory floor. What I found is that just using Gmail is faster. I could not get any degree of automation improvement by using agents for emails.

That was a very important lesson. Sometimes you fail, and that failure teaches you something valuable for the next challenge.

Conclusion

If you take away one thing from this blog post, let it be this: agent use is a skill, and like any skill, it requires deliberate practice, understanding of when it applies, and acceptance that you will fail often before you succeed.

The hype is real in some domains and misleading in others. Software engineering parallelization is real but not generalizable. The personal nature of AI-generated content is real and profound. The need for process thinking before automation is real and often ignored.

I hope these perspectives have been useful to help you think about how you can use agents, where agents work well, and what hype and what is not. The key is to think carefully, experiment often, and build skills for the long term. I hope this blog post will help you to make agents your own and see more and more benefits from agent-use.

The post Use Agents or Be Left Behind? A Personal Guide to Automating Your Own Work appeared first on Tim Dettmers.

Why AGI Will Not Happen

Tim Dettmers — Wed, 10 Dec 2025 15:05:30 +0000

If you are reading this, you probably have strong opinions about AGI, superintelligence, and the future of AI. Maybe you believe we are on the cusp of a transformative breakthrough. Maybe you are skeptical. This blog post is for those who want to think more carefully about these claims and examine them from a perspective that is often missing in the current discourse: the physical reality of computation.

I have been thinking about this topic for a while now, and what prompted me to finally write this down was a combination of things: a Twitter thread, conversations with friends, and a growing awareness that the thinking around AGI and superintelligence is not just optimistic, but fundamentally flawed. The purpose of this blog post is to address what I see as very sloppy thinking, thinking that is created in an echo chamber, particularly in the Bay Area, where the same ideas amplify themselves without critical awareness. This amplification of bad ideas and thinking exhuded by the rationalist and EA movements, is a big problem in shaping a beneficial future for everyone. Realistic thought can be used to ground where we are and where we have to go to shape a future that is good for everyone.

I want to talk about hardware improvements, AGI, superintelligence, scaling laws, the AI bubble, and related topics. But before we dive into these specific areas, I need to establish a foundation that is often overlooked in these discussions. Let me start with the most fundamental principle.

Computation is Physical

A key problem with ideas, particularly those coming from the Bay Area, is that they often live entirely in the idea space. Most people who think about AGI, superintelligence, scaling laws, and hardware improvements treat these concepts as abstract ideas that can be discussed like philosophical thought experiments. In fact, a lot of the thinking about superintelligence and AGI comes from Oxford-style philosophy. Oxford, the birthplace of effective altruism, mixed with the rationality culture from the Bay Area, gave rise to a strong distortion of how to clearly think about certain ideas. All of this sits on one fundamental misunderstanding of AI and scaling: computation is physical.

For effective computation, you need to balance two things. You need to move global information to a local neighborhood, and you need to pool multiple pieces of local information to transform old information into new. While the complexity of local computation is virtually constant — much accelerated by smaller transistors — movement scales quadratically with distance to local computation units. While memory movement also benefits from smaller transistors, improvements become quickly sublinear due to the squared nature of memory access patterns.

This is most easily seen by looking at cache hierarchies. L1, L2 and L3 cache are physically the same technology, but computationally they are very different. L2 and L3 are much larger than L1, but they are also much slower. This is because L2 and L3 are further away, physically, from the computational core, and memory lookups need to traverse a longer distance due to the physical size.

Two ideas to remember: First, larger caches are slower. Second, as we get smaller and smaller transistors, computation gets cheaper, but memory becomes more expensive, relatively speaking. The fraction of silicon area dedicated to memory on a chip has increased over time to the point where now computational elements on a chip are trivial in proportion. Almost all area is allocated to memory. In other words, if you want to produce 10 exaflops on a chip, you can do that easily — but you will not be able to service it with memory, making it useless FLOPS (the NVIDIA marketing department is good at ignoring this fact). All of this makes AI architectures like the transformer fundamentally physical. Our architectures are not abstract ideas that can be developed and thrown around carelessly. They are physical optimizations of information processing units.

To process information usefully, you need to do two things: compute local associations (MLP) and pool more distant associations to the local neighborhood (attention). This is because local information alone only helps you to distinguish closely related information, while pooling distant information helps you to form more complex associations that contrast or augment local details. The transformer is one of the most physically efficient architectures because it combines the simplest ways of doing this local computation and global pooling of information. The global pooling of information might be made more effective through research, and there is still active investigation going on that I think might be promising, but it has diminishing returns — the transformer architecture is close to physically optimal.

Computation is physical. This is also true for biological systems. The computational capacity of all animals is limited by the possible caloric intake in their ecological niche. If you have the average calorie intake of a primate, you can calculate within 99% accuracy how many neurons that primate has. Humans invented cooking, which increased the physically possible caloric intake substantially through predigestion. But we reached the physical limits of intelligence. When women are pregnant, they need to feed two brains, which is so expensive that physically, the gut cannot mobilize enough macronutrients to keep both alive if our brains were bigger. With bigger brains, we would not be able to have children — not because of the birth canal being too small, but because we would not be able to provide enough energy — making our current intelligence a physical boundary that we cannot cross due to energy limitations.

We are close to reaching the same limits for digital computation.

Linear Progress Needs Exponential Resources

There have been studies about progress in all kinds of fields that come to the same conclusion: linear progress needs exponential resources. What does that mean? If you want to improve a system further and further, make it more precise, or improve its efficiency, you need exponentially more resources with any improvement that you make. This is true for all kinds of fields and problems being investigated, and it is pretty clear why.

There are two realities at play here: one physical and one in the idea space. In the physical reality, if you need to accumulate resources in time and space to produce an outcome, then for logistical reasons, the overall effect that is locally produced needs linear resources to produce a linear outcome. But because of physicality and because matter takes up space, those resources can only be pooled at an increasingly slowing rate due to contention in space or time.

In the idea space, there is a similar phenomenon, which is less obvious. If two ideas are completely independent, they can have an effect that is ten times larger than any single idea. But if ideas are related, then the overall impact is limited due to diminishing returns — the ideas are just too correlated. If an idea builds on another, it can only be so much better. Often, if there is a dependency between ideas, one is a refinement of the other. Refinements, even if they are extremely creative, will yield incremental improvements. If a field is large enough, even if one tries to work on very different ideas, they are still heavily related to previous ideas. For example, while state-based models and Transformers seem like very different approaches to attention, they concentrate on the same problem. Very minimal gains can be achieved through any idea that modifies attention in these ways.

These relationships are most striking in physics. There was a time when progress could be made by individuals – not so much anymore.

I talked to a top theoretical physicist at a top research university, and he told me that all theoretical work in physics is, in some sense, either incremental refinement or made-up problems. The core problem of the idea space is this: if the idea is in the same sub-area, no meaningful innovation is possible because most things have already been thought. A first urge is to look for wildy creative ideas, but the problem is that are still bound by the rules of that subspace that often exist for a very good reason (see graduate-student-theory-of-everything-phenomenon). So the theoretical physicist faces only two meaningful choices: refine other ideas incrementally, which leads to insignificant impact; or work on rule-breaking unconventional ideas that are interesting but which will have no clear impact on physical theory.

The experimental physics demonstrates the physical limitations. The experiments that test more and more fundamental laws of physics and constituent particles — in other words, the standard model — become increasingly expensive. The standard model is incomplete, and we do not know how to fix it. Higher energies at the Large Hadron Collider have only led to more inconclusive results and the ruling out of more theories. We have no understanding of what dark energy or dark matter is, even though we build increasingly complex experiments that cost billions of dollars. The reality might be that certain aspects of physics are unknowable, hidden by complexity that cannot be attained with the resources that we can muster.

If you want to get linear improvements, you need exponential resources.

GPUs No Longer Improve

One of the most common misconceptions I see is that people assume hardware keeps improving and improving. This is an important misconception that explains a lot of the poor thinking around AI progress. The efficiency of GPUs has driven almost all innovation in AI. AlexNet was only possible by developing one of the first CUDA implementations that could compute convolutions over networked GPUs. Further innovation was mostly possible through improved GPUs and using more GPUs. Almost everybody sees this pattern — GPUs improve, AI performance improves — and it is easy to think that GPUs will improve further and will continue to improve AI outcomes. Every generation of GPUs has been better, and it would seem foolish to think that it will stop. But actually, it is foolish to think that GPUs will continue to improve. In fact, GPUs will no longer improve meaningfully. We have essentially seen the last generation of significant GPU improvements. GPUs maxed out in performance per cost around 2018 — after that, we added one-off features that exhaust quickly.

The first of these one-off features was 16-bit precision, then Tensor Cores, or the equivalent, then high-bandwidth memory (HBM),then the TMA or equivalent, then 8-bit precision, then 4-bit precision. And now we are at the end, both in the physical and the idea space. I have shown in my paper about k-bit inference scaling laws what data types with particular block sizes and computational arrangements are optimal. This has already been adopted by hardware manufacturers. Any further improvement will lead not to straightforward improvements but to trade-offs: either better memory footprint at lower computational efficiency or higher computational throughput at higher memory footprint. Even if you can innovate – linear improvements, need exponential resources – further improvements will be trivial and will not add any meaningful advancement.

While GPUs can no longer improve meaningfully, rack-level optimizations are still critically important. Efficient shuttling of key-value caches is one of the most important problems in AI infrastructure. The current solution to this problem, however, is also relatively straightforward. Companies like OpenAI boast about their AI infrastructure, but it is relatively simple to design because there is essentially only one optimal way to design it. And while it is complex to implement, it just needs clear thinking and mostly hard, time-intensive engineering. But the overall system design is not particularly novel. OpenAI – or other frontier labs – have no fundamental advantage in their inference and infrastructure stacks. The only way to gain an advantage is by having slightly better rack-level hardware optimizations or data-center-level hardware optimizations. But these will also run out quickly – maybe 2026, maybe 2027.

Why Scaling Is Not Enough

In my Twitter thread, I talked about how Gemini might signal a plateau in AI progress in the sense that we might not see meaningful improvements anymore. A lot of people responded with something along the lines of, “You are being too pessimistic. Can you not see that scaling works?” The point here is a bit more subtle, so I want to elaborate.

I believe in scaling laws and I believe scaling will improve performance, and models like Gemini are clearly good models. The problem with scaling is this: for linear improvements, we previously had exponential growth as GPUs which canceled out the exponential resource requirements of scaling. This is no longer true. In other words, previously we invested roughly linear costs to get linear payoff, but now it has turned to exponential costs. That would not be a problem on its own, but it sets a clear physical limit on scaling that is rapidly approaching. We have maybe one, maybe two more years of scaling left because further improvements become physically infeasible. The scaling improvements in 2025 were not impressive. Scaling in 2026 and 2027 better work out better.

Despite these exponential costs, the current infrastructure build-out is reasonable, particularly with the growth of inference use, but it still creates a very precarious balance. The biggest problem is this: if scaling does not provide much larger improvements than research/software innovations, then hardware becomes a liability and not an asset.

Small players like MoonshotAI and Z.ai show that they do not need many resources to reach frontier performance (I personally prefer Kimi K2-thinking over Sonnet 4.5 for coding). If these companies innovate beyond scale, they might just create the best model. While they might still use existing infrastructure, they could just switch to Huawei Ascend chips for inference, which are more than fine for providing good inference performance.

Another big threat to scale-up-infrastructure is that, currently, large-model inference efficiency is strongly related to a large user base due to network scaling. The problem is that efficient deployments of a large model needs a certain amount of GPUs to be efficient enough to overlap computation with networking and KV-cache length partitioning. Such deployments are ultra-efficient but demand a large user base to unlock full utilization and with that, cost-effectiveness. That is why open-weight models currently have not had the expected impact, because the infrastructure cost of large deployments need a large user-base. However, this problem can be solved with software.

While vLLM and SGLang currently try to optimize frontier-type deployments, they do not provide this efficiency at smaller scales. With the right inference stack beyond vLLM/SGLang, people would be able to deploy a ~300-billion-parameter model with the same efficiency as OpenAI or Anthropic deploys their frontier models. If smaller models become more capable — we see this with GLM 4.6 — or if AI applications become more specialized, the infrastructure advantage of frontier labs might vanish overnight. The software complexity evaporates, and open-source, open-weight deployments might be close to physically optimal, both in terms of computational efficiency and information processing efficiency. This is a large risk for frontier players.

Under slowing scaling, any of these three factors might degrade the value of AI infrastructure significantly and rapidly: (1) research/software innovations, (2) strong open-weight inference stacks, (3) shift to other hardware.

The current trends do not look good for frontier labs.

Frontier AI Versus Economic Diffusion

The US and China follow two different approaches to AI. The US follows the idea that there will be one winner who takes it all – the one that builds superintelligence wins. Even coming short of superintelligence of AGI, if you have the best model, almost all people will use your model and not the competition’s model. The idea is: develop the biggest, badest model and people will come.

China’s philosophy is different. They believe model capabilities do not matter as much as application. What matters is how you use AI. The key indicator of progress is how much AI is integrated into everything and how useful it is. If one model is better than another, it does not automatically mean it will be used more widely. What is important is that the model is useful and yields productivity gains at a reasonable cost. If the current approach is more productive than the previous one, it will be adopted. But hyper-optimization for slightly better quality is not very effective. In most cases, settling on “good enough” yields the highest productivity gain.

I think it is easy to see that the US philosophy is short-sighted and very problematic — particularly if model capability slows. The Chinese philosophy is more long-term focused and pragmatic.

The key value of AI is that it is useful and increases productivity. That makes it beneficial. It is clear that, similarly to computers or the internet, AI will be used everywhere. The problem is that if AI were just used for coding and engineering, it would have a very limited impact. While a lot of economic activity is supported by digital programs, these also have diminishing returns, and producing more software will not improve outcomes significantly if existing software is already good enough (just look at the SAAS failure in China). This makes wide-spread economic integration absolutely vital for AI effectiveness.

So in order to provide real value, AI needs to be used in ways that provide new benefits, not just improvements to what already exists. This is a difficult problem, but the right answer is to integrate AI into everything to squeeze out non-linear improvements, see what works and what does not, then keep what is working. China is taking this approach by subsidizing applications that use AI to encourage adoption. The Chinese population is very receptive to innovation, which facilitates this process. It is nothing unusual in China to see an 80-year-old grandma use AI to help her with their daily life. The US, on the other hand, bets on ideas like AGI and superintelligence, which I believe are fundamentally flawed concepts that have little relevance to future AI progress. This becomes clear when you think carefully about what these terms actually mean in physical reality.

AGI Will Never Happen, and Superintelligence Is a Fantasy

There is this pattern I have noticed: when you ask people in the Bay Area when AGI will happen, they always say it is a few years in the future, and it will have a massive impact. Then, if you ask them what AGI actually is, they do not include any physical tasks in their definition, and they do not consider resource inputs.

True AGI, that can do all things human, would need to be able to physical tasks – which comprises the largest economic sector. In short, AGI should include physical robots or machines that are able to do economically meaningful work in the physical world. While physical robots might be convenient for unloading your dishwasher, you will not see them replacing specialized systems in factories. Specialized robots in factories are too efficient, too precise. China demonstrates that dark factories — fully automated facilities — are already possible. Most robotics problems are solved problems in controlled environments. Most existing robotics problems that remain unsolved are also economically unviable. Stitching sleeves to a t-shirt is an unsolved robotics problem, but it is also not particularly economically meaningful in most contexts. Household robots will be interesting, but if it takes me two minutes to unload my dishwasher, I am not sure I need a robot for that. And while in a couple of years a robot might be able to fold laundry, I would rather spend a few minutes folding it myself with no creases than have a robot do a mediocre job.

The main problem with robotics is that learning follows scaling laws that are very similar to the scaling laws of language models. The problem is that data in the physical world is just too expensive to collect, and the physical world is too complex in its details. Robotics will have limited impacts. Factories are already automated and other tasks are not economically meaningful.

The concept of superintelligence is built on a flawed premise. The idea is that once you have an intelligence that is as good or better than humans — in other words, AGI — then that intelligence can improve itself, leading to a runaway effect. This idea comes from Oxford-based philosophers who brought these concepts to the Bay Area. It is a deeply flawed idea that is harmful for the field. The main flaw is that this idea treats intelligence as purely abstract and not grounded in physical reality. To improve any system, you need resources. And even if a superintelligence uses these resources more effectively than humans to improve itself, it is still bound by the scaling of improvements I mentioned before — linear improvements need exponential resources. Diminishing returns can be avoided by switching to more independent problems – like adding one-off features to GPUs – but these quickly hit their own diminishing returns. So, superintelligence can be thought of as filling gaps in capability, not extending the frontier. Filling gaps can be useful, but it does not lead to runaway effects — it leads to incremental improvements.

Furthermore, the same people who think that GPUs will infinitely improve are often the people who think superintelligence will make those improvements faster and better. But they do not realize that GPUs can no longer be meaningfully improved. We can wait for better HBM memory technology for speed, and for chiplets and advanced packaging to improve yield/cost, but that is it. Rack-level optimization will likely hit the physical wall in 2026 or 2027. A superintelligence will not accelerate the progress made in HBM development, manufacturing, testing, and integration. The transformer architecture is close to physically optimal. Superintelligence will not be able to meaningfully improve neural network architectures. Efficient large-scale deployments for inference are largely a solved engineering problem. It just needs some careful engineering and time, but very little creativity is required to solve this problem close to physical optimality. Superintelligence will not be able to improve our inference stack by much.

A superintelligence might help with economic diffusion of AI technology, but in the end, the limiting factor of economic diffusion is implementation and adoption, not capability. It is clear to me that any organization that strives primarily for superintelligence as a goal will encounter significant challenges and will ultimately falter and be displaced by players that provide general economic diffusion.

In summary, AGI, as commonly conceived, will not happen because it ignores the physical constraints of computation, the exponential costs of linear progress, and the fundamental limits we are already encountering. Superintelligence is a fantasy because it assumes that intelligence can recursively self-improve without bound, ignoring the physical and economic realities that constrain all systems. These ideas persist not because they are well-founded, but because they serve as compelling narratives in an echo chamber that rewards belief over rigor.

The future of AI will be shaped by economic diffusion, practical applications, and incremental improvements within physical constraints — not by mythical superintelligence or the sudden emergence of AGI. The sooner we accept this reality, the better we can focus on building AI systems that actually improve human productivity and well-being.

The post Why AGI Will Not Happen appeared first on Tim Dettmers.

Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning

Tim Dettmers — Mon, 30 Jan 2023 15:50:00 +0000

Deep learning is a field with intense computational requirements, and your choice of GPU will fundamentally determine your deep learning experience. But what features are important if you want to buy a new GPU? GPU RAM, cores, tensor cores, caches? How to make a cost-efficient choice? This blog post will delve into these questions, tackle common misconceptions, give you an intuitive understanding of how to think about GPUs, and will lend you advice, which will help you to make a choice that is right for you.

This blog post is designed to give you different levels of understanding of GPUs and the new Ampere series GPUs from NVIDIA. You have the choice: (1) If you are not interested in the details of how GPUs work, what makes a GPU fast compared to a CPU, and what is unique about the new NVIDIA RTX 40 Ampere series, you can skip right to the performance and performance per dollar charts and the recommendation section. The cost/performance numbers form the core of the blog post and the content surrounding it explains the details of what makes up GPU performance.

(2) If you worry about specific questions, I have answered and addressed the most common questions and misconceptions in the later part of the blog post.

(3) If you want to get an in-depth understanding of how GPUs, caches, and Tensor Cores work, the best is to read the blog post from start to finish. You might want to skip a section or two based on your understanding of the presented topics.

Overview

This blog post is structured in the following way. First, I will explain what makes a GPU fast. I will discuss CPUs vs GPUs, Tensor Cores, memory bandwidth, and the memory hierarchy of GPUs and how these relate to deep learning performance. These explanations might help you get a more intuitive sense of what to look for in a GPU. I discuss the unique features of the new NVIDIA RTX 40 Ampere GPU series that are worth considering if you buy a GPU. From there, I make GPU recommendations for different scenarios. After that follows a Q&A section of common questions posed to me in Twitter threads; in that section, I will also address common misconceptions and some miscellaneous issues, such as cloud vs desktop, cooling, AMD vs NVIDIA, and others.

How do GPUs work?

If you use GPUs frequently, it is useful to understand how they work. This knowledge will help you to undstand cases where are GPUs fast or slow. In turn, you might be able to understand better why you need a GPU in the first place and how other future hardware options might be able to compete. You can skip this section if you just want the useful performance numbers and arguments to help you decide which GPU to buy. The best high-level explanation for the question of how GPUs work is my following Quora answer:

Read Tim Dettmers‘ answer to Why are GPUs well-suited to deep learning? on Quora

This is a high-level explanation that explains quite well why GPUs are better than CPUs for deep learning. If we look at the details, we can understand what makes one GPU better than another.

The Most Important GPU Specs for Deep Learning Processing Speed

This section can help you build a more intuitive understanding of how to think about deep learning performance. This understanding will help you to evaluate future GPUs by yourself. This section is sorted by the importance of each component. Tensor Cores are most important, followed by memory bandwidth of a GPU, the cache hierachy, and only then FLOPS of a GPU.

Tensor Cores

Tensor Cores are tiny cores that perform very efficient matrix multiplication. Since the most expensive part of any deep neural network is matrix multiplication Tensor Cores are very useful. In fast, they are so powerful, that I do not recommend any GPUs that do not have Tensor Cores.

It is helpful to understand how they work to appreciate the importance of these computational units specialized for matrix multiplication. Here I will show you a simple example of A*B=C matrix multiplication, where all matrices have a size of 32×32, what a computational pattern looks like with and without Tensor Cores. This is a simplified example, and not the exact way how a high performing matrix multiplication kernel would be written, but it has all the basics. A CUDA programmer would take this as a first “draft” and then optimize it step-by-step with concepts like double buffering, register optimization, occupancy optimization, instruction-level parallelism, and many others, which I will not discuss at this point.

To understand this example fully, you have to understand the concepts of cycles. If a processor runs at 1GHz, it can do 10^9 cycles per second. Each cycle represents an opportunity for computation. However, most of the time, operations take longer than one cycle. Thus we essentially have a queue where the next operations needs to wait for the next operation to finish. This is also called the latency of the operation.

Here are some important latency cycle timings for operations. These times can change from GPU generation to GPU generation. These numbers are for Ampere GPUs, which have relatively slow caches.

Global memory access (up to 80GB): ~380 cycles
L2 cache: ~200 cycles
L1 cache or Shared memory access (up to 128 kb per Streaming Multiprocessor): ~34 cycles
Fused multiplication and addition, a*b+c (FFMA): 4 cycles
Tensor Core matrix multiply: 1 cycle

Each operation is always performed by a pack of 32 threads. This pack is termed a warp of threads. Warps usually operate in a synchronous pattern — threads within a warp have to wait for each other. All memory operations on the GPU are optimized for warps. For example, loading from global memory happens at a granularity of 32*4 bytes, exactly 32 floats, exactly one float for each thread in a warp. We can have up to 32 warps = 1024 threads in a streaming multiprocessor (SM), the GPU-equivalent of a CPU core. The resources of an SM are divided up among all active warps. This means that sometimes we want to run fewer warps to have more registers/shared memory/Tensor Core resources per warp.

For both of the following examples, we assume we have the same computational resources. For this small example of a 32×32 matrix multiply, we use 8 SMs (about 10% of an RTX 3090) and 8 warps per SM.

To understand how the cycle latencies play together with resources like threads per SM and shared memory per SM, we now look at examples of matrix multiplication. While the following example roughly follows the sequence of computational steps of matrix multiplication for both with and without Tensor Cores, please note that these are very simplified examples. Real cases of matrix multiplication involve much larger shared memory tiles and slightly different computational patterns.

Matrix multiplication without Tensor Cores

If we want to do an A*B=C matrix multiply, where each matrix is of size 32×32, then we want to load memory that we repeatedly access into shared memory because its latency is about five times lower (200 cycles vs 34 cycles). A memory block in shared memory is often referred to as a memory tile or just a tile. Loading two 32×32 floats into a shared memory tile can happen in parallel by using 2*32 warps. We have 8 SMs with 8 warps each, so due to parallelization, we only need to do a single sequential load from global to shared memory, which takes 200 cycles.

To do the matrix multiplication, we now need to load a vector of 32 numbers from shared memory A and shared memory B and perform a fused multiply-and-accumulate (FFMA). Then store the outputs in registers C. We divide the work so that each SM does 8x dot products (32×32) to compute 8 outputs of C. Why this is exactly 8 (4 in older algorithms) is very technical. I recommend Scott Gray’s blog post on matrix multiplication to understand this. This means we have 8x shared memory accesses at the cost of 34 cycles each and 8 FFMA operations (32 in parallel), which cost 4 cycles each. In total, we thus have a cost of:

200 cycles (global memory) + 8*34 cycles (shared memory) + 8*4 cycles (FFMA) = 504 cycles

Let’s look at the cycle cost of using Tensor Cores.

Matrix multiplication with Tensor Cores

With Tensor Cores, we can perform a 4×4 matrix multiplication in one cycle. To do that, we first need to get memory into the Tensor Core. Similarly to the above, we need to read from global memory (200 cycles) and store in shared memory. To do a 32×32 matrix multiply, we need to do 8×8=64 Tensor Cores operations. A single SM has 8 Tensor Cores. So with 8 SMs, we have 64 Tensor Cores — just the number that we need! We can transfer the data from shared memory to the Tensor Cores with 1 memory transfers (34 cycles) and then do those 64 parallel Tensor Core operations (1 cycle). This means the total cost for Tensor Cores matrix multiplication, in this case, is:

200 cycles (global memory) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 235 cycles.

Thus we reduce the matrix multiplication cost significantly from 504 cycles to 235 cycles via Tensor Cores. In this simplified case, the Tensor Cores reduced the cost of both shared memory access and FFMA operations.

This example is simplified, for example, usually each thread needs to calculate which memory to read and write to as you transfer data from global memory to shared memory. With the new Hooper (H100) architectures we additionally have the Tensor Memory Accelerator (TMA) compute these indices in hardware and thus help each thread to focus on more computation rather than computing indices.

Matrix multiplication with Tensor Cores and Asynchronous copies (RTX 30/RTX 40) and TMA (H100)

The RTX 30 Ampere and RTX 40 Ada series GPUs additionally have support to perform asynchronous transfers between global and shared memory. The H100 Hopper GPU extends this further by introducing the Tensor Memory Accelerator (TMA) unit. the TMA unit combines asynchronous copies and index calculation for read and writes simultaneously — so each thread no longer needs to calculate which is the next element to read and each thread can focus on doing more matrix multiplication calculations. This looks as follows.

The TMA unit fetches memory from global to shared memory (200 cycles). Once the data arrives, the TMA unit fetches the next block of data asynchronously from global memory. While this is happening, the threads load data from shared memory and perform the matrix multiplication via the tensor core. Once the threads are finished they wait for the TMA unit to finish the next data transfer, and the sequence repeats.

As such, due to the asynchronous nature, the second global memory read by the TMA unit is already progressing as the threads process the current shared memory tile. This means, the second read takes only 200 – 34 – 1 = 165 cycles.

Since we do many reads, only the first memory access will be slow and all other memory accesses will be partially overlapped with the TMA unit. Thus on average, we reduce the time by 35 cycles.

165 cycles (wait for async copy to finish) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 200 cycles.

Which accelerates the matrix multiplication by another 15%.

From these examples, it becomes clear why the next attribute, memory bandwidth, is so crucial for Tensor-Core-equipped GPUs. Since global memory is the by far the largest cycle cost for matrix multiplication with Tensor Cores, we would even have faster GPUs if the global memory latency could be reduced. We can do this by either increasing the clock frequency of the memory (more cycles per second, but also more heat and higher energy requirements) or by increasing the number of elements that can be transferred at any one time (bus width).

Memory Bandwidth

From the previous section, we have seen that Tensor Cores are very fast. So fast, in fact, that they are idle most of the time as they are waiting for memory to arrive from global memory. For example, during GPT-3-sized training, which uses huge matrices — the larger, the better for Tensor Cores — we have a Tensor Core TFLOPS utilization of about 45-65%, meaning that even for the large neural networks about 50% of the time, Tensor Cores are idle.

This means that when comparing two GPUs with Tensor Cores, one of the single best indicators for each GPU’s performance is their memory bandwidth. For example, The A100 GPU has 1,555 GB/s memory bandwidth vs the 900 GB/s of the V100. As such, a basic estimate of speedup of an A100 vs V100 is 1555/900 = 1.73x.

L2 Cache / Shared Memory / L1 Cache / Registers

Since memory transfers to the Tensor Cores are the limiting factor in performance, we are looking for other GPU attributes that enable faster memory transfer to Tensor Cores. L2 cache, shared memory, L1 cache, and amount of registers used are all related. To understand how a memory hierarchy enables faster memory transfers, it helps to understand how matrix multiplication is performed on a GPU.

To perform matrix multiplication, we exploit the memory hierarchy of a GPU that goes from slow global memory, to faster L2 memory, to fast local shared memory, to lightning-fast registers. However, the faster the memory, the smaller it is.

While logically, L2 and L1 memory are the same, L2 cache is larger and thus the average physical distance that need to be traversed to retrieve a cache line is larger. You can see the L1 and L2 caches as organized warehouses where you want to retrieve an item. You know where the item is, but to go there takes on average much longer for the larger warehouse. This is the essential difference between L1 and L2 caches. Large = slow, small = fast.

For matrix multiplication we can use this hierarchical separate into smaller and smaller and thus faster and faster chunks of memory to perform very fast matrix multiplications. For that, we need to chunk the big matrix multiplication into smaller sub-matrix multiplications. These chunks are called memory tiles, or often for short just tiles.

We perform matrix multiplication across these smaller tiles in local shared memory that is fast and close to the streaming multiprocessor (SM) — the equivalent of a CPU core. With Tensor Cores, we go a step further: We take each tile and load a part of these tiles into Tensor Cores which is directly addressed by registers. A matrix memory tile in L2 cache is 3-5x faster than global GPU memory (GPU RAM), shared memory is ~7-10x faster than the global GPU memory, whereas the Tensor Cores’ registers are ~200x faster than the global GPU memory.

Having larger tiles means we can reuse more memory. I wrote about this in detail in my TPU vs GPU blog post. In fact, you can see TPUs as having very, very, large tiles for each Tensor Core. As such, TPUs can reuse much more memory with each transfer from global memory, which makes them a little bit more efficient at matrix multiplications than GPUs.

Each tile size is determined by how much memory we have per streaming multiprocessor (SM) and how much we L2 cache we have across all SMs. We have the following shared memory sizes on the following architectures:

Volta (Titan V): 128kb shared memory / 6 MB L2
Turing (RTX 20s series): 96 kb shared memory / 5.5 MB L2
Ampere (RTX 30s series): 128 kb shared memory / 6 MB L2
Ada (RTX 40s series): 128 kb shared memory / 72 MB L2

We see that Ada has a much larger L2 cache allowing for larger tile sizes, which reduces global memory access. For example, for BERT large during training, the input and weight matrix of any matrix multiplication fit neatly into the L2 cache of Ada (but not other Us). As such, data needs to be loaded from global memory only once and then data is available throught the L2 cache, making matrix multiplication about 1.5 – 2.0x faster for this architecture for Ada. For larger models the speedups are lower during training but certain sweetspots exist which may make certain models much faster. Inference, with a batch size larger than 8 can also benefit immensely from the larger L2 caches.

Estimating Ada / Hopper Deep Learning Performance

This section is for those who want to understand the more technical details of how I derive the performance estimates for Ampere GPUs. If you do not care about these technical aspects, it is safe to skip this section.

Practical Ada / Hopper Speed Estimates

Suppose we have an estimate for one GPU of a GPU-architecture like Hopper, Ada, Ampere, Turing, or Volta. It is easy to extrapolate these results to other GPUs from the same architecture/series. Luckily, NVIDIA already benchmarked the A100 vs V100 vs H100 across a wide range of computer vision and natural language understanding tasks. Unfortunately, NVIDIA made sure that these numbers are not directly comparable by using different batch sizes and the number of GPUs whenever possible to favor results for the H100 GPU. So in a sense, the benchmark numbers are partially honest, partially marketing numbers. In general, you could argue that using larger batch sizes is fair, as the H100/A100 GPU has more memory. Still, to compare GPU architectures, we should evaluate unbiased memory performance with the same batch size.

To get an unbiased estimate, we can scale the data center GPU results in two ways: (1) account for the differences in batch size, (2) account for the differences in using 1 vs 8 GPUs. We are lucky that we can find such an estimate for both biases in the data that NVIDIA provides.

Doubling the batch size increases throughput in terms of images/s (CNNs) by 13.6%. I benchmarked the same problem for transformers on my RTX Titan and found, surprisingly, the very same result: 13.5% — it appears that this is a robust estimate.

As we parallelize networks across more and more GPUs, we lose performance due to some networking overhead. The A100 8x GPU system has better networking (NVLink 3.0) than the V100 8x GPU system (NVLink 2.0) — this is another confounding factor. Looking directly at the data from NVIDIA, we can find that for CNNs, a system with 8x A100 has a 5% lower overhead than a system of 8x V100. This means if going from 1x A100 to 8x A100 gives you a speedup of, say, 7.00x, then going from 1x V100 to 8x V100 only gives you a speedup of 6.67x. For transformers, the figure is 7%.

Using these figures, we can estimate the speedup for a few specific deep learning architectures from the direct data that NVIDIA provides. The Tesla A100 offers the following speedup over the Tesla V100:

SE-ResNeXt101: 1.43x
Masked-R-CNN: 1.47x
Transformer (12 layer, Machine Translation, WMT14 en-de): 1.70x

Thus, the figures are a bit lower than the theoretical estimate for computer vision. This might be due to smaller tensor dimensions, overhead from operations that are needed to prepare the matrix multiplication like img2col or Fast Fourier Transform (FFT), or operations that cannot saturate the GPU (final layers are often relatively small). It could also be artifacts of the specific architectures (grouped convolution).

The practical transformer estimate is very close to the theoretical estimate. This is probably because algorithms for huge matrices are very straightforward. I will use these practical estimates to calculate the cost efficiency of GPUs.

Possible Biases in Estimates

The estimates above are for H100, A100 , and V100 GPUs. In the past, NVIDIA sneaked unannounced performance degradations into the “gaming” RTX GPUs: (1) Decreased Tensor Core utilization, (2) gaming fans for cooling, (3) disabled peer-to-peer GPU transfers. It might be possible that there are unannounced performance degradations in the RTX 40 series compared to the full Hopper H100.

As of now, one of these degradations was found for Ampere GPUs: Tensor Core performance was decreased so that RTX 30 series GPUs are not as good as Quadro cards for deep learning purposes. This was also done for the RTX 20 series, so it is nothing new, but this time it was also done for the Titan equivalent card, the RTX 3090. The RTX Titan did not have performance degradation enabled.

Currently, no degradation for Ada GPUs are known, but I update this post with news on this and let my followers on twitter know.

Advantages and Problems for RTX40 and RTX 30 Series

The new NVIDIA Ampere RTX 30 series has additional benefits over the NVIDIA Turing RTX 20 series, such as sparse network training and inference. Other features, such as the new data types, should be seen more as an ease-of-use-feature as they provide the same performance boost as Turing does but without any extra programming required.

The Ada RTX 40 series has even further advances like 8-bit Float (FP8) tensor cores. The RTX 40 series also has similar power and temperature issues compared to the RTX 30. The issue of melting power connector cables in the RTX 40 can be easily prevented by connecting the power cable correctly.

Sparse Network Training

Ampere allows for fine-grained structure automatic sparse matrix multiplication at dense speeds. How does this work? Take a weight matrix and slice it into pieces of 4 elements. Now imagine 2 elements of these 4 to be zero. Figure 1 shows how this could look like.

Figure X: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. Figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?fit=249%2C300&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?fit=321%2C387&ssl=1" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?resize=258%2C311&ssl=1" alt="Figure 1: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. The figure is taken from Jeff Pool's GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA." class="wp-image-935" width="258" height="311" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?w=321&ssl=1 321w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?resize=249%2C300&ssl=1 249w" sizes="(max-width: 258px) 100vw, 258px" data-recalc-dims="1" />

Figure 1: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

When you multiply this sparse weight matrix with some dense inputs, the sparse matrix tensor core feature in Ampere automatically compresses the sparse matrix to a dense representation that is half the size as can be seen in Figure 2. After this compression, the densely compressed matrix tile is fed into the tensor core which computes a matrix multiplication of twice the usual size. This effectively yields a 2x speedup since the bandwidth requirements during matrix multiplication from shared memory are halved.

Figure X: The sparse matrix is compressed to a dense representation before the matrix multiplication is performed. Figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?fit=300%2C181&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?fit=1024%2C619&ssl=1" width="1024" height="619" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?resize=1024%2C619&ssl=1" alt="Figure 2: The sparse matrix is compressed to a dense representation before the matrix multiplication is performed. " class="wp-image-934" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?resize=1024%2C619&ssl=1 1024w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?resize=300%2C181&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?resize=768%2C464&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?w=1055&ssl=1 1055w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Figure 2: The sparse matrix is compressed to a dense representation before the matrix multiplication is performed. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

I was working on sparse network training in my research and I also wrote a blog post about sparse training. One criticism of my work was that “You reduce the FLOPS required for the network, but it does not yield speedups because GPUs cannot do fast sparse matrix multiplication.” Well, with the addition of the sparse matrix multiplication feature for Tensor Cores, my algorithm, or other sparse training algorithms, now actually provide speedups of up to 2x during training.

Figure X: The sparse training algorithm developed has three stages: (1) Determine the importance of each layer. (2) Remove the smallest, unimportant weights. (3) Grow new weights proportional to the importance of each layers.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?fit=300%2C145&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?fit=1024%2C493&ssl=1" width="1024" height="493" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=1024%2C493&ssl=1" alt="Figure 3: The sparse training algorithm that I developed has three stages: (1) Determine the importance of each layer. (2) Remove the smallest, unimportant weights. (3) Grow new weights proportional to the importance of each layer. Read more about my work in my sparse training blog post." class="wp-image-779" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=1024%2C493&ssl=1 1024w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=300%2C145&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=768%2C370&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?w=1096&ssl=1 1096w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Figure 3: The sparse training algorithm that I developed has three stages: (1) Determine the importance of each layer. (2) Remove the smallest, unimportant weights. (3) Grow new weights proportional to the importance of each layer. Read more about my work in my sparse training blog post.

While this feature is still experimental and training sparse networks are not commonplace yet, having this feature on your GPU means you are ready for the future of sparse training.

Low-precision Computation

In my work, I’ve previously shown that new data types can improve stability during low-precision backpropagation.

Figure X: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent large numbers and small numbers with high precision.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?fit=300%2C93&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?fit=869%2C268&ssl=1" width="869" height="268" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?resize=869%2C268&ssl=1" alt="Figure 4: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent numbers that are both large and small with high precision." class="wp-image-941" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?w=869&ssl=1 869w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?resize=300%2C93&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?resize=768%2C237&ssl=1 768w" sizes="(max-width: 869px) 100vw, 869px" data-recalc-dims="1" />

Figure 4: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent numbers that are both large and small with high precision.

Currently, if you want to have stable backpropagation with 16-bit floating-point numbers (FP16), the big problem is that ordinary FP16 data types only support numbers in the range [-65,504, 65,504]. If your gradient slips past this range, your gradients explode into NaN values. To prevent this during FP16 training, we usually perform loss scaling where you multiply the loss by a small number before backpropagating to prevent this gradient explosion.

The BrainFloat 16 format (BF16) uses more bits for the exponent such that the range of possible numbers is the same as for FP32: [-3*10^38, 3*10^38]. BF16 has less precision, that is significant digits, but gradient precision is not that important for learning. So what BF16 does is that you no longer need to do any loss scaling or worry about the gradient blowing up quickly. As such, we should see an increase in training stability by using the BF16 format as a slight loss of precision.

What this means for you: With BF16 precision, training might be more stable than with FP16 precision while providing the same speedups. With 32-bit TensorFloat (TF32) precision, you get near FP32 stability while giving the speedups close to FP16. The good thing is, to use these data types, you can just replace FP32 with TF32 and FP16 with BF16 — no code changes required!

Overall, though, these new data types can be seen as lazy data types in the sense that you could have gotten all the benefits with the old data types with some additional programming efforts (proper loss scaling, initialization, normalization, using Apex). As such, these data types do not provide speedups but rather improve ease of use of low precision for training.

Fan Designs and GPUs Temperature Issues

While the new fan design of the RTX 30 series performs very well to cool the GPU, different fan designs of non-founders edition GPUs might be more problematic. If your GPU heats up beyond 80C, it will throttle itself and slow down its computational speed / power. This overheating can happen in particular if you stack multiple GPUs next to each other. A solution to this is to use PCIe extenders to create space between GPUs.

Spreading GPUs with PCIe extenders is very effective for cooling, and other fellow PhD students at the University of Washington and I use this setup with great success. It does not look pretty, but it keeps your GPUs cool! This has been running with no problems at all for 4 years now. It can also help if you do not have enough space to fit all GPUs in the PCIe slots. For example, if you can find the space within a desktop computer case, it might be possible to buy standard 3-slot-width RTX 4090 and spread them with PCIe extenders within the case. With this, you might solve both the space issue and cooling issue for a 4x RTX 4090 setup with a single simple solution.

4x GPUs with PCIe extenders

" data-image-caption="

4x GPUs with PCIe extenders

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?fit=225%2C300&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?fit=768%2C1024&ssl=1" width="768" height="1024" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders.jpg?resize=768%2C1024&ssl=1" alt="Figure 5: 4x GPUs with PCIe extenders. It looks like a mess, but it is very effective for cooling. I used this rig for 2 years and cooling is excellent despite problematic RTX 2080 Ti Founders Edition GPUs." class="wp-image-861" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?resize=768%2C1024&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?resize=225%2C300&ssl=1 225w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?resize=1152%2C1536&ssl=1 1152w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?resize=1536%2C2048&ssl=1 1536w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?w=1920&ssl=1 1920w" sizes="(max-width: 768px) 100vw, 768px" data-recalc-dims="1" />

Figure 5: 4x GPUs with PCIe extenders. It looks like a mess, but it is very effective for cooling. I used this rig for 4 years and cooling is excellent despite problematic RTX 2080 Ti Founders Edition GPUs.

3-slot Design and Power Issues

The RTX 3090 and RTX 4090 are 3-slot GPUs, so one will not be able to use it in a 4x setup with the default fan design from NVIDIA. This is kind of justified because it runs at over 350W TDP, and it will be difficult to cool in a multi-GPU 2-slot setting. The RTX 3080 is only slightly better at 320W TDP, and cooling a 4x RTX 3080 setup will also be very difficult.

It is also difficult to power a 4x 350W = 1400W or 4x 450W = 1800W system in the 4x RTX 3090 or 4x RTX 4090 case. Power supply units (PSUs) of 1600W are readily available, but having only 200W to power the CPU and motherboard can be too tight. The components’ maximum power is only used if the components are fully utilized, and in deep learning, the CPU is usually only under weak load. With that, a 1600W PSU might work quite well with a 4x RTX 3080 build, but for a 4x RTX 3090 build, it is better to look for high wattage PSUs (+1700W). Some of my followers have had great success with cryptomining PSUs — have a look in the comment section for more info about that. Otherwise, it is important to note that not all outlets support PSUs above 1600W, especially in the US. This is the reason why in the US, there are currently few standard desktop PSUs above 1600W on the market. If you get a server or cryptomining PSUs, beware of the form factor — make sure it fits into your computer case.

Power Limiting: An Elegant Solution to Solve the Power Problem?

It is possible to set a power limit on your GPUs. So you would be able to programmatically set the power limit of an RTX 3090 to 300W instead of their standard 350W. In a 4x GPU system, that is a saving of 200W, which might just be enough to build a 4x RTX 3090 system with a 1600W PSU feasible. It also helps to keep the GPUs cool. So setting a power limit can solve the two major problems of a 4x RTX 3080 or 4x RTX 3090 setups, cooling, and power, at the same time. For a 4x setup, you still need effective blower GPUs (and the standard design may prove adequate for this), but this resolves the PSU problem.

Figure X: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent.

" data-image-caption="

Figure X: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?fit=298%2C300&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?fit=1017%2C1024&ssl=1" width="1017" height="1024" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=1017%2C1024&ssl=1" alt="Figure 6: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent." class="wp-image-933" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=1017%2C1024&ssl=1 1017w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=298%2C300&ssl=1 298w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=150%2C150&ssl=1 150w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=768%2C773&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?w=1187&ssl=1 1187w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Figure 6: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent.

You might ask, “Doesn’t this slow down the GPU?” Yes, it does, but the question is by how much. I benchmarked the 4x RTX 2080 Ti system shown in Figure 5 under different power limits to test this. I benchmarked the time for 500 mini-batches for BERT Large during inference (excluding the softmax layer). I choose BERT Large inference since, from my experience, this is the deep learning model that stresses the GPU the most. As such, I would expect power limiting to have the most massive slowdown for this model. As such, the slowdowns reported here are probably close to the maximum slowdowns that you can expect. The results are shown in Figure 7.

Figure 6: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer).

" data-image-caption="

Figure 6: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer).

" data-medium-file="https://timdettmers.com/wp-content/uploads/2020/09/RTX-2080-Ti-Slowdown-vs-Power-Limit.svg" data-large-file="https://timdettmers.com/wp-content/uploads/2020/09/RTX-2080-Ti-Slowdown-vs-Power-Limit.svg" width="853" height="703" src="https://timdettmers.com/wp-content/uploads/2020/09/RTX-2080-Ti-Slowdown-vs-Power-Limit.svg" alt="Figure 7: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer)." class="wp-image-939"/>

Figure 7: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer).

As we can see, setting the power limit does not seriously affect performance. Limiting the power by 50W — more than enough to handle 4x RTX 3090 — decreases performance by only 7%.

RTX 4090s and Melting Power Connectors: How to Prevent Problems

There was a misconception that RTX 4090 power cables melt because they were bent. However, it was found that only 0.1% of users had this problem and the problem occured due to user error. Here a video that shows that the main problem is that cables were not inserted correctly.

So using RTX 4090 cards is perfectly safe if you follow the following install instructions:

If you use an old cable or old GPU make sure the contacts are free of debri / dust.
Use the power connector and stick it into the socket until you hear a *click* — this is the most important part.
Test for good fit by wiggling the power cable left to right. The cable should not move.
Check the contact with the socket visually, there should be no gap between cable and socket.

8-bit Float Support in H100 and RTX 40 series GPUs

The support of the 8-bit Float (FP8) is a huge advantage for the RTX 40 series and H100 GPUs. With 8-bit inputs it allows you to load the data for matrix multiplication twice as fast, you can store twice as much matrix elements in your caches which in the Ada and Hopper architecture are very large, and now with FP8 tensor cores you get 0.66 PFLOPS of compute for a RTX 4090 — this is more FLOPS then the entirety of the worlds fastest supercomputer in year 2007. 4x RTX 4090 with FP8 compute rival the faster supercomputer in the world in year 2010 (deep learning started to work just in 2009).

The main problem with using 8-bit precision is that transformers can get very unstable with so few bits and crash during training or generate non-sense during inference. I have written a paper about the emergence of instabilities in large language models and I also written a more accessible blog post.

The main take-way is this: Using 8-bit instead of 16-bit makes things very unstable, but if you keep a couple of dimensions in high precision everything works just fine.

Main results from my work on 8-bit matrix multiplication for Large Language Models (LLMs). We can see that the best 8-bit baseline fails to deliver good zero-shot performance. The method that I developed, LLM.int8(), can perform Int8 matrix multiplication with the same results as the 16-bit baseline.

But Int8 was already supported by the RTX 30 / A100 / Ampere generation GPUs, why is FP8 in the RTX 40 another big upgrade? The FP8 data type is much more stable than the Int8 data type and its easy to use it in functions like layer norm or non-linear functions, which are difficult to do with Integer data types. This will make it very straightforward to use it in training and inference. I think this will make FP8 training and inference relatively common in a couple of months.

If you want to read more about the advantages of Float vs Integer data types you can read my recent paper about k-bit inference scaling laws. Below you can see one relevant main result for Float vs Integer data types from this paper. We can see that bit-by-bit, the FP4 data type preserve more information than Int4 data type and thus improves the mean LLM zeroshot accuracy across 4 tasks.

4-bit Inference scaling laws for Pythia Large Language Models for different data types. We see that bit-by-bit, 4-bit float data types have better zeroshot accuracy compared to the Int4 data types.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?fit=300%2C226&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?fit=1024%2C773&ssl=1" width="1024" height="773" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?resize=1024%2C773&ssl=1" alt="" class="wp-image-1151" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?resize=1024%2C773&ssl=1 1024w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?resize=300%2C226&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?resize=768%2C580&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?w=1159&ssl=1 1159w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

4-bit Inference scaling laws for Pythia Large Language Models for different data types. We see that bit-by-bit, 4-bit float data types have better zeroshot accuracy compared to the Int4 data types.

Raw Performance Ranking of GPUs

Below we see a chart of raw relevative performance across all GPUs. We see that there is a gigantic gap in 8-bit performance of H100 GPUs and old cards that are optimized for 16-bit performance.

Shown is raw relative performance of GPUs. For example, an RTX 4090 has about 0.33x performance of a H100 SMX for 8-bit inference. In other words, a H100 SMX is three times faster for 8-bit inference compared to a RTX 4090.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?fit=300%2C295&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?fit=1024%2C1006&ssl=1" width="1024" height="1006" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=1024%2C1006&ssl=1" alt="" class="wp-image-1160" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=1024%2C1006&ssl=1 1024w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=300%2C295&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=768%2C754&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=1536%2C1509&ssl=1 1536w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?w=1703&ssl=1 1703w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Shown is raw relative transformer performance of GPUs. For example, an RTX 4090 has about 0.33x performance of a H100 SMX for 8-bit inference. In other words, a H100 SMX is three times faster for 8-bit inference compared to a RTX 4090.

For this data, I did not model 8-bit compute for older GPUs. I did so, because 8-bit Inference and training are much more effective on Ada/Hopper GPUs because of the 8-bit Float data type and Tensor Memory Accelerator (TMA) which saves the overhead of computing read/write indices which is particularly helpful for 8-bit matrix multiplication. Ada/Hopper also have FP8 support, which makes in particular 8-bit training much more effective.

I did not model numbers for 8-bit training because to model that I need to know the latency of L1 and L2 caches on Hopper/Ada GPUs, and they are unknown and I do not have access to such GPUs. On Hopper/Ada, 8-bit training performance can well be 3-4x of 16-bit training performance if the caches are as fast as rumored.

But even with the new FP8 tensor cores there are some additional issues which are difficult to take into account when modeling GPU performance. For example, FP8 tensor cores do not support transposed matrix multiplication which means backpropagation needs either a separate transpose before multiplication or one needs to hold two sets of weights — one transposed and one non-transposed — in memory. I used two sets of weight when I experimented with Int8 training in my LLM.int8() project and this reduced the overall speedups quite significantly. I think one can do better with the right algorithms/software, but this shows that missing features like a transposed matrix multiplication for tensor cores can affect performance.

For old GPUs, Int8 inference performance is close to the 16-bit inference performance for models below 13B parameters. Int8 performance on old GPUs is only relevant if you have relatively large models with 175B parameters or more. If you are interested in 8-bit performance of older GPUs, you can read the Appendix D of my LLM.int8() paper where I benchmark Int8 performance.

GPU Deep Learning Performance per Dollar

Below we see the chart for the performance per US dollar for all GPUs sorted by 8-bit inference performance. How to use the chart to find a suitable GPU for you is as follows:

Determine the amount of GPU memory that you need (rough heuristic: at least 12 GB for image generation; at least 24 GB for work with transformers)
While 8-bit inference and training is experimental, it will become standard within 6 months. You might need to do some extra difficult coding to work with 8-bit in the meantime. Is that OK for you? If not, select for 16-bit performance.
Using the metric determined in (2), find the GPU with the highest relative performance/dollar that has the amount of memory you need.

We can see that the RTX 4070 Ti is most cost-effective for 8-bit and 16-bit inference while the RTX 3080 remains most cost-effective for 16-bit training. While these GPUs are most cost-effective, they are not necessarily recommended as they do not have sufficient memory for many use-cases. However, it might be the ideal cards to get started on your deep learning journey. Some of these GPUs are excellent for Kaggle competition where one can often rely on smaller models. Since to do well in Kaggle competitions the method of how you work is more important than the models size, many of these smaller GPUs are excellent for Kaggle competitions.

The best GPUs for academic and startup servers seem to be A6000 Ada GPUs (not to be confused with A6000 Turing). The H100 SXM GPU is also very cost effective and has high memory and very strong performance. If I would build a small cluster for a company/academic lab, I would use 66-80% A6000 GPUs and 20-33% H100 SXM GPUs. If I get a good deal on L40 GPUs, I would also pick them instead of A6000, so you can always ask for a quote on these.

Shown is relative performance per US Dollar of GPUs normalized by the cost for a desktop computer and the average Amazon and eBay price for each GPU. Additionally, the electricity cost of ownership for 5 years is added with an electricity price of 0.175 USD per kWh and a 15% GPU utilization rate. The electricity cost for a RTX 4090 is about $100 per year. How to read and interpret the chart: a desktop computer with RTX 4070 Ti cards owned for 5 years yields about 2x more 8-bit inference performance per dollar compared to a RTX 3090 GPU.

GPU Recommendations

I have a create a recommendation flow-chart that you can see below (click here for interactive app from Nan Xiao). While this chart will help you in 80% of cases, it might not quite work for you because the options might be too expensive. In that case, try to look at the benchmarks above and pick the most cost effective GPU that still has enough GPU memory for your use-case. You can estimate the GPU memory needed by running your problem in the vast.ai or Lambda Cloud for a while so you know what you need. The vast.ai or Lambda Cloud might also work well if you only need a GPU very sporadically (every couple of days for a few hours) and you do not need to download and process large dataset to get started. However, cloud GPUs are usually not a good option if you use your GPU for many months with a high usage rate each day (12 hours each day). You can use the example in the “When is it better to use the cloud vs a dedicated GPU desktop/server?” section below to determine if cloud GPUs are good for you.

GPU recommendation chart for Ada/Hopper GPUs. Follow the answers to the Yes/No questions to find the GPU that is most suitable for you. While this chart works well in about 80% of cases, you might end up with a GPU that is too expensive. Use the cost/performance charts above to make a selection instead.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?fit=300%2C244&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?fit=845%2C686&ssl=1" width="845" height="686" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?resize=845%2C686&ssl=1" alt="" class="wp-image-1173" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?w=845&ssl=1 845w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?resize=300%2C244&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?resize=768%2C623&ssl=1 768w" sizes="(max-width: 845px) 100vw, 845px" data-recalc-dims="1" />

Is it better to wait for future GPUs for an upgrade? The future of GPUs.

To understand if it makes sense to skip this generation and buy the next generation of GPUs, it makes sense to talk a bit about what improvements in the future will look like.

In the past it was possible to shrink the size of transistors to improve speed of a processor. This is coming to an end now. For example, while shrinking SRAM increased its speed (smaller distance, faster memory access), this is no longer the case. Current improvements in SRAM do not improve its performance anymore and might even be negative. While logic such as Tensor Cores get smaller, this does not necessarily make GPU faster since the main problem for matrix multiplication is to get memory to the tensor cores which is dictated by SRAM and GPU RAM speed and size. GPU RAM still increases in speed if we stack memory modules into high-bandwidth modules (HBM3+), but these are too expensive to manufacture for consumer applications. The main way to improve raw speed of GPUs is to use more power and more cooling as we have seen in the RTX 30s and 40s series. But this cannot go on for much longer.

Chiplets such as used by AMD CPUs are another straightforward way forward. AMD beat Intel by developing CPU chiplets. Chiplets are small chips that are fused together with a high speed on-chip network. You can think about them as two GPUs that are so physically close together that you can almost consider them a single big GPU. They are cheaper to manufacture, but more difficult to combine into one big chip. So you need know-how and fast connectivity between chiplets. AMD has a lot of experience with chiplet design. AMD’s next generation GPUs are going to be chiplet designs, while NVIDIA currently has no public plans for such designs. This may mean that the next generation of AMD GPUs might be better in terms of cost/performance compared to NVIDIA GPUs.

However, the main performance boost for GPUs is currently specialized logic. For example, the asynchronous copy hardware units on the Ampere generation (RTX 30 / A100 / RTX 40) or the extension, the Tensor Memory Accelerator (TMA), both reduce the overhead of copying memory from the slow global memory to fast shared memory (caches) through specialized hardware and so each thread can do more computation. The TMA also reduces overhead by performing automatic calculations of read/write indices which is particularly important for 8-bit computation where one has double the elements for the same amount of memory compared to 16-bit computation. So specialized hardware logic can accelerate matrix multiplication further.
Low-bit precision is another straightforward way forward for a couple of years. We will see widespread adoption of 8-bit inference and training in the next months. We will see widespread 4-bit inference in the next year. Currently, the technology for 4-bit training does not exists, but research looks promising and I expect the first high performance FP4 Large Language Model (LLM) with competitive predictive performance to be trained in 1-2 years time.

Going to 2-bit precision for training currently looks pretty impossible, but it is a much easier problem than shrinking transistors further. So progress in hardware mostly depends on software and algorithms that make it possible to use specialized features offered by the hardware.

We will probably be able to still improve the combination of algorithms + hardware to the year 2032, but after that will hit the end of GPU improvements (similar to smartphones). The wave of performance improvements after 2032 will come from better networking algorithms and mass hardware. It is uncertain if consumer GPUs will be relevant at this point. It might be that you need an RTX 9090 to run run Super HyperStableDiffusion Ultra Plus 9000 Extra or OpenChatGPT 5.0, but it might also be that some company will offer a high-quality API that is cheaper than the electricity cost for a RTX 9090 and you want to use a laptop + API for image generation and other tasks.

Overall, I think investing into a 8-bit capable GPU will be a very solid investment for the next 9 years. Improvements at 4-bit and 2-bit are likely small and other features like Sort Cores would only become relevant once sparse matrix multiplication can be leveraged well. We will probably see some kind of other advancement in 2-3 years which will make it into the next GPU 4 years from now, but we are running out of steam if we keep relying on matrix multiplication. This makes investments into new GPUs last longer.

Question & Answers & Misconceptions

Do I need PCIe 4.0 or PCIe 5.0?

Generally, no. PCIe 5.0 or 4.0 is great if you have a GPU cluster. It is okay if you have an 8x GPU machine, but otherwise, it does not yield many benefits. It allows better parallelization and a bit faster data transfer. Data transfers are not a bottleneck in any application. In computer vision, in the data transfer pipeline, the data storage can be a bottleneck, but not the PCIe transfer from CPU to GPU. So there is no real reason to get a PCIe 5.0 or 4.0 setup for most people. The benefits will be maybe 1-7% better parallelization in a 4 GPU setup.

Do I need 8x/16x PCIe lanes?

Same as with PCIe 4.0 — generally, no. PCIe lanes are needed for parallelization and fast data transfers, which are seldom a bottleneck. Operating GPUs on 4x lanes is fine, especially if you only have 2 GPUs. For a 4 GPU setup, I would prefer 8x lanes per GPU, but running them at 4x lanes will probably only decrease performance by around 5-10% if you parallelize across all 4 GPUs.

How do I fit 4x RTX 4090 or 3090 if they take up 3 PCIe slots each?

You need to get one of the two-slot variants, or you can try to spread them out with PCIe extenders. Besides space, you should also immediately think about cooling and a suitable PSU.

PCIe extenders might also solve both space and cooling issues, but you need to make sure that you have enough space in your case to spread out the GPUs. Make sure your PCIe extenders are long enough!

How do I cool 4x RTX 3090 or 4x RTX 3080?

See the previous section.

Can I use multiple GPUs of different GPU types?

Yes, you can! But you cannot parallelize efficiently across GPUs of different types since you will often go at the speed of the slowest GPU (data and fully sharded parallelism). So different GPUs work just fine, but parallelization across those GPUs will be inefficient since the fastest GPU will wait for the slowest GPU to catch up to a synchronization point (usually gradient update).

What is NVLink, and is it useful?

Generally, NVLink is not useful. NVLink is a high speed interconnect between GPUs. It is useful if you have a GPU cluster with +128 GPUs. Otherwise, it yields almost no benefits over standard PCIe transfers.

I do not have enough money, even for the cheapest GPUs you recommend. What can I do?

Definitely buy used GPUs. You can buy a small cheap GPU for prototyping and testing and then roll out for full experiments to the cloud like vast.ai or Lambda Cloud. This can be cheap if you train/fine-tune/inference on large models only every now and then and spent more time protoyping on smaller models.

What is the carbon footprint of GPUs? How can I use GPUs without polluting the environment?

I built a carbon calculator for calculating your carbon footprint for academics (carbon from flights to conferences + GPU time). The calculator can also be used to calculate a pure GPU carbon footprint. You will find that GPUs produce much, much more carbon than international flights. As such, you should make sure you have a green source of energy if you do not want to have an astronomical carbon footprint. If no electricity provider in our area provides green energy, the best way is to buy carbon offsets. Many people are skeptical about carbon offsets. Do they work? Are they scams?

I believe skepticism just hurts in this case, because not doing anything would be more harmful than risking the probability of getting scammed. If you worry about scams, just invest in a portfolio of offsets to minimize risk.

I worked on a project that produced carbon offsets about ten years ago. The carbon offsets were generated by burning leaking methane from mines in China. UN officials tracked the process, and they required clean digital data and physical inspections of the project site. In that case, the carbon offsets that were produced were highly reliable. I believe many other projects have similar quality standards.

What do I need to parallelize across two machines?

If you want to be on the safe side, you should get at least +50Gbits/s network cards to gain speedups if you want to parallelize across machines. I recommend having at least an EDR Infiniband setup, meaning a network card with at least 50 GBit/s bandwidth. Two EDR cards with cable are about $500 on eBay.

In some cases, you might be able to get away with 10 Gbit/s Ethernet, but this is usually only the case for special networks (certain convolutional networks) or if you use certain algorithms (Microsoft DeepSpeed).

Is the sparse matrix multiplication features suitable for sparse matrices in general?

It does not seem so. Since the granularity of the sparse matrix needs to have 2 zero-valued elements, every 4 elements, the sparse matrices need to be quite structured. It might be possible to adjust the algorithm slightly, which involves that you pool 4 values into a compressed representation of 2 values, but this also means that precise arbitrary sparse matrix multiplication is not possible with Ampere GPUs.

Do I need an Intel CPU to power a multi-GPU setup?

I do not recommend Intel CPUs unless you heavily use CPUs in Kaggle competitions (heavy linear algebra on the CPU). Even for Kaggle competitions AMD CPUs are still great, though. AMD CPUs are cheaper and better than Intel CPUs in general for deep learning. For a 4x GPU built, my go-to CPU would be a Threadripper. We built dozens of systems at our university with Threadrippers, and they all work great — no complaints yet. For 8x GPU systems, I would usually go with CPUs that your vendor has experience with. CPU and PCIe/system reliability is more important in 8x systems than straight performance or straight cost-effectiveness.

Does computer case design matter for cooling?

No. GPUs are usually perfectly cooled if there is at least a small gap between GPUs. Case design will give you 1-3 C better temperatures, space between GPUs will provide you with 10-30 C improvements. The bottom line, if you have space between GPUs, cooling does not matter. If you have no space between GPUs, you need the right cooler design (blower fan) or another solution (water cooling, PCIe extenders), but in either case, case design and case fans do not matter.

Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA?

Not in the next 1-2 years. It is a three-way problem: Tensor Cores, software, and community.

AMD GPUs are great in terms of pure silicon: Great FP16 performance, great memory bandwidth. However, their lack of Tensor Cores or the equivalent makes their deep learning performance poor compared to NVIDIA GPUs. Packed low-precision math does not cut it. Without this hardware feature, AMD GPUs will never be competitive. Rumors show that some data center card with Tensor Core equivalent is planned for 2020, but no new data emerged since then. Just having data center cards with a Tensor Core equivalent would also mean that few would be able to afford such AMD GPUs, which would give NVIDIA a competitive advantage.

Let’s say AMD introduces a Tensor-Core-like-hardware feature in the future. Then many people would say, “But there is no software that works for AMD GPUs! How am I supposed to use them?” This is mostly a misconception. The AMD software via ROCm has come to a long way, and support via PyTorch is excellent. While I have not seen many experience reports for AMD GPUs + PyTorch, all the software features are integrated. It seems, if you pick any network, you will be just fine running it on AMD GPUs. So here AMD has come a long way, and this issue is more or less solved.

However, if you solve software and the lack of Tensor Cores, AMD still has a problem: the lack of community. If you have a problem with NVIDIA GPUs, you can Google the problem and find a solution. That builds a lot of trust in NVIDIA GPUs. You have the infrastructure that makes using NVIDIA GPUs easy (any deep learning framework works, any scientific problem is well supported). You have the hacks and tricks that make usage of NVIDIA GPUs a breeze (e.g., apex). You can find experts on NVIDIA GPUs and programming around every other corner while I knew much less AMD GPU experts.

In the community aspect, AMD is a bit like Julia vs Python. Julia has a lot of potential, and many would say, and rightly so, that it is the superior programming language for scientific computing. Yet, Julia is barely used compared to Python. This is because the Python community is very strong. Numpy, SciPy, Pandas are powerful software packages that a large number of people congregate around. This is very similar to the NVIDIA vs AMD issue.

Thus, it is likely that AMD will not catch up until Tensor Core equivalent is introduced (1/2 to 1 year?) and a strong community is built around ROCm (2 years?). AMD will always snatch a part of the market share in specific subgroups (e.g., cryptocurrency mining, data centers). Still, in deep learning, NVIDIA will likely keep its monopoly for at least a couple more years.

When is it better to use the cloud vs a dedicated GPU desktop/server?

Rule-of-thumb: If you expect to do deep learning for longer than a year, it is cheaper to get a desktop GPU. Otherwise, cloud instances are preferable unless you have extensive cloud computing skills and want the benefits of scaling the number of GPUs up and down at will.

Numbers in the following paragraphs are going to change, but it serves as a scenario that helps you to understand the rough costs. You can use similar math to determine if cloud GPUs are the best solution for you.

For the exact point in time when a cloud GPU is more expensive than a desktop depends highly on the service that you are using, and it is best to do a little math on this yourself. Below I do an example calculation for an AWS V100 spot instance with 1x V100 and compare it to the price of a desktop with a single RTX 3090 (similar performance). The desktop with RTX 3090 costs $2,200 (2-GPU barebone + RTX 3090). Additionally, assuming you are in the US, there is an additional $0.12 per kWh for electricity. This compares to $2.14 per hour for the AWS on-demand instance.

At 15% utilization per year, the desktop uses:

(350 W (GPU) + 100 W (CPU))*0.15 (utilization) * 24 hours * 365 days = 591 kWh per year

So 591 kWh of electricity per year, that is an additional $71.

The break-even point for a desktop vs a cloud instance at 15% utilization (you use the cloud instance 15% of time during the day), would be about 300 days ($2,311 vs $2,270):

$2.14/h * 0.15 (utilization) * 24 hours * 300 days = $2,311

So if you expect to run deep learning models after 300 days, it is better to buy a desktop instead of using AWS on-demand instances.

You can do similar calculations for any cloud service to make the decision if you go for a cloud service or a desktop.

Common utilization rates are the following:

PhD student personal desktop: < 15%
PhD student slurm GPU cluster: > 35%
Company-wide slurm research cluster: > 60%

In general, utilization rates are lower for professions where thinking about cutting edge ideas is more important than developing practical products. Some areas have low utilization rates (interpretability research), while other areas have much higher rates (machine translation, language modeling). In general, the utilization of personal machines is almost always overestimated. Commonly, most personal systems have a utilization rate between 5-10%. This is why I would highly recommend slurm GPU clusters for research groups and companies instead of individual desktop GPU machines.

Version History

2023-01-30: Improved font and recommendation chart. Added 5 years cost of ownership electricity perf/USD chart. Updated Async copy and TMA functionality. Slight update to FP8 training. General improvements.
2023-01-16: Added Hopper and Ada GPUs. Added GPU recommendation chart. Added information about the TMA unit and L2 cache.
2020-09-20: Added discussion of using power limiting to run 4x RTX 3090 systems. Added older GPUs to the performance and cost/performance charts. Added figures for sparse matrix multiplication.
2020-09-07: Added NVIDIA Ampere series GPUs. Included lots of good-to-know GPU details.
2019-04-03: Added RTX Titan and GTX 1660 Ti. Updated TPU section. Added startup hardware discussion.
2018-11-26: Added discussion of overheating issues of RTX cards.
2018-11-05: Added RTX 2070 and updated recommendations. Updated charts with hard performance data. Updated TPU section.
2018-08-21: Added RTX 2080 and RTX 2080 Ti; reworked performance analysis
2017-04-09: Added cost-efficiency analysis; updated recommendation with NVIDIA Titan Xp
2017-03-19: Cleaned up blog post; added GTX 1080 Ti
2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
2015-02-23: Updated GPU recommendations and memory calculations
2014-09-28: Added emphasis for memory requirement of CNNs

Acknowledgments

I thank Suhail for making me aware of outdated prices on H100 GPUs, Gjorgji Kjosev for pointing out font issues, Anonymous for pointing out that the TMA unit does not exist on Ada GPUs, Scott Gray for pointing out that FP8 tensor cores have no transposed matrix multiplication, and reddit and HackerNews users for pointing out many other improvements.

For past updates of this blog post, I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointing out notebook solutions for AWS instances. I want to thank Brad Nemire for providing me with an RTX Titan for benchmarking purposes. I want to thank Agrin Hilmkil, Ari Holtzman, Gabriel Ilharco, Nam Pho for their excellent feedback on the previous version of this blog post.

The post Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning appeared first on Tim Dettmers.

LLM.int8() and Emergent Features

Tim Dettmers — Wed, 17 Aug 2022 12:51:19 +0000

When I attended NAACL, I wanted to do a little test. I had two pitches for my LLM.int8() paper. One pitch is about how I use advanced quantization methods to achieve no performance degradation transformer inference at scale that makes large models more accessible. The other pitch talks about emergent outliers in transformers and how they radically change what transformers learn and how they function.

From that, I learned that quantization research is like printers. Nobody cares about printers. Nobody likes printers. But everybody is happy if printers do their job.

How that job is done for you through the bitsandbytes library with Hugging Face integration so that you can easily run OPT-175B and BLOOM-176B on a single machine is described in another blog post by my colleague and collaborator Younes Belkada.

This blog post will spill some mandatory details about quantization, but I want to mostly make it about these emergent features that I found in transformers at scale. I know the claims in the paper are highly robust. This blog post is a more speculative version of the paper that teases out the super curious details about the fascinating properties surrounding the emergent outlier features I found. I cannot spill all the details because my next project will delve deep into understanding these outlier features, but the space is so rich that I am happy to give you many curious details.

Mandatory quantization details

In a previous version of this blog post, I jokingly had a section with the big title “All You Ever Wanted to Know about Quantization” The section read: “If you quantize from 16-bit to 8-bit, you lose precision which might degrade model prediction quality.”

That is it.

Most people do not want to learn more about quantization — and honestly, the small sentence above is already enough information. The details are very gritty and complicated, but it is all in the code. The math and concepts are very simple and straightforward — if you have worked on quantization before. If you have not encountered quantization, it is likely a hot devilish nightmare that will eat your liver.

For those that say, “Pfff! Why do I need a liver anyways?”. Well, here you go. For others, just move ahead and read about the mysteries of emergent features.

What is quantization?

Let us say you have a data type I5 with values [0, 1, 2, 3, 4, 5] and a data type, I3, with values [0, 2, 4], how do you quantize from data type I5 to I3? You follow a two-step procedure:

Normalize the range of I5 into I3.
Round to the nearest value of I3.

Let’s do an example. Let’s say we have the vector [3, 1, 2, 3] in I5, and we want to quantize to I3.

Here the step-by-step recipe for quantization:

We find the absolute maximum value of the vector: [3, 1, 2, 3] -> 3
Then we divide by that value: [3, 1, 2, 3] -> [1, 0.33, 0.66, 1.0]
And now we multiple by the range of the target data type I3, which is 4: [1, 0.33, 0.66, 1.0] -> [4.0, 1.33, 2.66, 4.0]
Now we round to the nearest value: [4.0, 1.33, 2.66, 4.0] -> [4, 0, 2, 4]

We now converted [3, 1, 2, 4] in I5 to [4, 0, 2, 4] in I3. To dequantize, we reverse this process.

Divide by 4: [4, 0, 2, 4] -> [1.0, 0.0, 0.5, 1.0]
Multiply by the absolute maximum: [1.0, 0.0, 0.5, 1.0] -> [3.0, 0.0, 1.5, 3.0]
Now we round again: [3.0, 0.0, 1.5, 3.0] -> [3, 0, 2, 3]

We see that our dequantization and quantization led to one error:
[3, 1, 2, 3] to [3, 0, 2, 3]
The second element changed from 1 to 0. This is a quantization error that leads to the loss of information in terms of how precise the information is encoded. If we have such errors and propagate them through many layers of a neural network, they accumulate, and they may change the result of a prediction and degrade the prediction quality.

How to make quantization methods more precise

Quantization can be enhanced in two ways. Use a better data type, or use more normalization constants (absolute maximum).

Regarding data types, Int8 is a terrible data type for deep learning. That is why I developed new data types in my research. However, currently, GPUs do not support other than Int8 data types on the hardware level, and as such, we are out of luck and need to use Int8.

The only way to improve quantization is through more normalization constants. A normalization constant squishes the input distribution, for example, I5, into the target distribution, for example, I3. We can increase precision, by squishing each vector only as much as is needed. For example, if you have the two vectors:

[3, 1, 2, 3]
[0, 2, 2, 0]

Then you can squish the first by 4 and the second by 2. This will give you twice the precision to quantize the second vector because the inputs are now spread over a broader range of the I3 data type. In fact, the second vector can be quantized without errors if you use an additional absolute maximum value. If you use only a single constant over both vectors (tensor-wise constants), then you will have two errors.

Vector-wise quantization

So now that we know how to make quantization more precise, how do we achieve maximum precision for matrix multiplication?

The key is this: If we use different normalization constants for dependent vectors, we then need to recover this information in the dequantization step. For example, if we subtract a constant to center one distribution over another: (A-minA)(B-minB) then to dequantize the output in A*B=C we need to do:

A*B = C

(A-minA)(B-minB) = A*B – A*minB – B*minA + minA*minB = C – A*minB – B*minA + minA*minB

As such, dependent quantization produces additional computation, in this case, a couple of matrix-vector multiplications and additions which can be expensive (if we assume, A and B are matrices).

As such, we look for the most normalization constants we can get that are still independent. What does this look like?

We can see a matrix multiplication as a sequence of independent inner products between row vectors of A and column vectors of B. We can have a separate constant for each of these vectors. Denormalization happens by multiplying these two constants together for a particular element. No other computation is needed. This is vector-wise quantization. More details in the paper.

Mixed precision decomposition

Before we come to the emergent magnitude features, let me explain the last part of our method that is absolutely critical to achieving zero-degradation quantization at scales of up to 175B parameters.

So it turns out, that transformers have these emergent features that have very large values. They occur in particular hidden dimensions and are active in up to 75% of all sequence dimensions. They occur in all layers (well most layers, but we come to that). So if you have a transformer hidden state X of dimensionality [batch, sequence, hidden], then X[:, :, i] for some i have values that look like this:

[-60.. -45, -51, -35, -20, -67]

Whereas 99.9% of dimensions look like this (normally distributed with one outlier)

[-0.10, -0.23, 0.08, -0.38, -0.28, -0.29, -2.11, 0.34, -0.53, -67.0]

If we quantize and dequantize a row without an outlier, we get this:

[-0.10, -0.23, 0.08, -0.38, -0.28, -0.28, -2.11, 0.33, -0.53]

only a single error, -0.28 instead of -0.29, at the 0.01 precision level. However, if we quantize the same vector with the outlier, we get this:

[ -0.00, -0.00, 0.00, -0.53, -0.53, -0.53, -2.11, 0.53, -0.53, -67.00]

In other words, even if we use vector-wise quantization, we squish a lot of information to zero and have large errors. On average, vectors without outliers have a mean error of 0.015. This vector has an error of 0.12. Do this for a couple of layers, and we remove all information and end up with pure noise.

The problem is that at a scale of 6.7B parameters and above, 75% of hidden state sequences are affected. So this absolutely wrecks quantization.

The good news is that these outliers are highly systematic. While you have 150,000 outliers per sequence in a 6.7B transformer, they only occur in 6 feature dimensions (6 different indices “i” as in X[:, :, i]).

As such, we can separate these emergent features into a separate, high precision matrix multiplication, quantize the other 99.9% of values to Int8, can combine the output of both matrix multiplications. This avoids the information squishing to zero effect, and we can recover full transformer performance.

Results

The results show that this method works well. We can recover full performance by using the LLM.int8() quantization procedure. You can clearly see that there is a big dip in performance for the 8-bit baseline, which is vector-wise quantization. We need both vector-wise quantization and mixed precision decomposition, that is, the full LLM.int8() method to recover full performance. Either of these methods alone is not sufficient.

Emergent Features

There are a lot of exciting findings in the paper:

Emergence is not sudden but gradual and grows according to an exponential function related to perplexity and not model size.
Outlier features grow very quickly once their phase shift occurs.
The number of outliers features is strictly proportional to perplexity.

Many other findings did not make it into the paper because these were too difficult to verify robustly, but I wanted to share them here anyway. Since these results are less robust, take them with a grain of salt.

But I am jumping ahead: What is the emergence, and what makes an emergent feature? If I put it in my own words, I would say:

Emergence is a gradual change in a property that suddenly undergoes a phase shift and then changes the quality of its substrate.

Let’s think step-by-step.

Substrate: Transformer
Property: Very large features in particular hidden dimensions across the transformer
Gradual change: Decreasing perplexity, more and larger outlier features
Phase shift: Outlier features suddenly become available in all transformer layers and coordinate through a few hidden dimensions.
Change of quality: Highly sparse, almost discrete attention; very dense FFN layers; “dual attention”; long-range attention (?); stable training through increased numerical stability.

Some additional terms. What is a feature?

If you have hidden state X that is passed along a transformer with dimensionality [batch, sequence, hidden], then a feature is a particular dimension X[:, :, i], which offers some weak explanation for the label.

Emergent Features in a Nutshell

To get a little sense of what this is all about, here is a short explanation encapsulating everything important about emergent features.

The most intuitive explanation of feature outliers is that transformers have two processing streams. One stream learns features that explain the inputs, and the other stream learns features that remove other features. Removing noisy, context-irrelevant features is the key to making accurate predictions. The more noisy, context-irrelevant features you remove in early layers, the less conflicting high-level features you have in later layers.

For example, if you classify dogs vs. cats, it makes sense to “sharpen” the key features that make these animals different (e.g. cat eyes, cat ears) and remove the similar features (fur color and potentially texture). This is particularly relevant if you have many noisy “weak” features as in natural language processing.

If you take this mechanism to an extreme, you can get discretization, which goes hand-in-hand with context-dependent memory and “reasoning” over elements. Discretization means, you have, say, 100 features, but you decide to remove 99% of them by setting them to zero, and you amplify the rest. The result is a single feature that is now a discrete entity. Once discretized, this entity can be stored and reused later.

To coordinate these streams throughout the transformer, it is useful to dedicate certain hidden dimensions to the functionality of removing other features. That way, if the transformer needs to remove features, it knows beforehand which feature dimension to access to perform that functionality.

How do you remove features? You have a single dimension with very large positive or negative values, and you multiply that dimension with a positive/negative number.

Take the following matrix, which is similar to how emergent features are represented in hidden states.

[0, 1, -60, 4]
[3, 0, -50, -2]
[-1, 0, -55, 1]
[3, 2, -60, 1]

If we want to remove, say, features (columns) 0 and 3 in a matrix multiplication followed by a non-linear function, all we have to do is to multiply everything by a negative number and multiply the outlier feature for columns 0 and 3 by a positive number. If we do this with negative and positive 1s, it looks like this:

[-1, -1, -1, -1]
[-1, -1, -1, -1]
[ 1, -1, -1, 1]
[-1, -1, -1, -1]

We receive the following after a softmax:

[0, 0.5, 0.5, 0]
[0, 0.5, 0.5, 0]
[0, 0.5, 0.5, 0]
[0, 0.5, 0.5, 0]

The neat thing about this system is, that if you always maintain the outlier feature in dimension 3, you know beforehand where to insert a positive number to remove a feature (row 3 of the other matrix).

Transformers seem to coordinate these dimensions throughout all layers except the attention function and the second feedforward network where these outliers are “consumed” to remove features.

This means that transformers always use a certain dimension for these outliers, and each layer “knows” beforehand how to remove a feature because these feature dimensions always have very large values with a particular sign (some are negative, some are positive).

However, this full “coordination” through a single dimension only happens after the phase shift. Before the phase shift, in transformers with less than 6.7B parameters some layers disagree which dimension to use for these large features.

How Emergent Features Emerge

Emergent outlier features are present in even very small transformers (125M parameters), and they do start out in the attention projection layers (key/query/value). Feature outliers are “consumed” in the attention function (softmax) and the second fully connected sublayer (contraction layer). The outlier features are likely consumed in these layers since the second feedforward network (FFN) sub-layer, and the softmax have non-linear functions that can easily squash features to zero.

Once you scale transformers a bit more (350M to 1.3B), outliers also occur in the FFN and attention output layers. At this scale, some successive attention layers and FFN layers use the same dimension to coordinate what features to remove. This has synergy. The attention layer is good at context-dependent selection and pattern matching, while the FFN layers are good at globally, context-independent pattern matching.

At this scale, however, outliers are still probabilistic. This means they occur mostly in some dimensions, but these dimensions can change slightly from mini-batch to mini-batch and between layer and layer. At this scale, layers have not yet learned to coordinate outlier features through the same dimension. This makes it more difficult to remove unwanted features.

At the 2.7B to 6B scale, things become much more coordinated. Now 60% of layers agree on which outlier dimension to use.

The phase shift happens around 6.7B, where 100% of layers use the same dimension for outliers. At this point, a couple of things happen rapidly:

Outliers become very large quickly. They grow from about 15 for a 6B model to about 60 for a 13B model. OPT-66B has outliers of size around 95, which indicates this growth phase is temporary.
Attention layers become very sparse. The attention is very concentrated so that just a few sequence dimensions determine the top probability and the overall probability mass. Almost all sequence dimensions have zero probability. However, this is still context-dependent, and the transformer seems to be “unsure” what to attend to for some sequences.
FFN layers become more “dense”. While in computer vision, you can prune about 95% of weights without severe performance degradation, that number is 30% for transformers trained on NLP data. After emergence, this number shrinks to well below 5%. It seems that canceling out features can remove noise that is generated from the many weak features that are activated. Because these are silenced now, each set of neurons can learn much more features that are almost independent of each other due to the masking of context-dependent features.
Transformers become more stable. If you treat the outlier features separately, I believe you can probably run and even train transformers in less than 8-bit precision without degradation in performance.

The Most Important Take-aways for Your Research

You may say, “This is all good and well, Tim, but what does this mean for me and my research?” Good question! I think it changes quite a bit.

There are two types of transformers and you should not generalize from one to the other.

From these findings it is clear that transformer after the phase shift at 6.7B parameters behave very different to transformers before the phase shift. As such, one should not try to generalize from <6.7B transformers to beyond 6.7B parameters.

But training and using 6.7B transformers can be pretty painful. At Facebook AI research, I had a 1.3B parameter baseline and I would usually run 2-3 of those models on 128 GPUs each for a total of 384 GPUs. Despite these massive resources it would still feel “slow” in that my research progress was mostly hindered by compute. I imagine to train 6.7B models on 8 GPUs or even 32 GPUs must be super painful. Is there a way that we can avoid this?

I think another key finding from the paper can help. We found that emergence of features occurs smoothly according to an exponential distribution of decreasing perplexity. As such, one could do the following.

We train multiple smaller models, say, 125M, 350M and 1.3B parameters, and then we measure the emergent property in those models and relate it to the property that we are interested in analyzing, for example, a new architecture or a new from of interpreting models. Once we gathered this data, we can measure how the change in the emergent property changes the results of our new method. With that, we might be able to determine if our new method generalizes to models beyond 6.7B parameters.

While, by definition, the phase shift leads to a stark change in behavior, this method of extrapolating emergent behavior might yield more robust predictions for your research. It would be effortful and complicated to do this, but this is better than “wishful thinking” research that does not generalize.

We might be able to find new emergent properties by studying “scaling laws of emergence”.

The finding that emergence can be measured in small models means that new emergent properties that require models larger than 175B parameters might be already measurable in the open-source OPT models.

If we can correlate statistics of a property with increasing capabilities and if this property follows a function that will eventually, “threshold”, we might have discovered a new emergent property that leads to new capabilities.

Conclusion

In this blog post, I introduced LLM.int8() and gave an introduction into the emergent features that we discovered in language model at scale. I discussed the implication of these emergent features in particular how it relates to generalization.

The post LLM.int8() and Emergent Features appeared first on Tim Dettmers.

How to Choose Your Grad School

Tim Dettmers — Sun, 13 Mar 2022 22:43:57 +0000

If you are reading this, then you probably finished the long and arduous journey to grad school. You emerged victoriously, and this success is well-deserved. But which school should you choose? How to make a right choice if all schools look great in their own way? This blog post is centered around these questions. It is most useful if you are a computer science student aiming to study machine learning and, in particular, natural language processing in the US, but most of the information here is equally valid for any field of research and any country.

The choice of grad school that is right for you can be tricky and confusing. We live in a time of hyper-competitiveness, where even undergrads need to optimize for metrics like paper count to make it to the next level — grad school. This heavily career-centered perspective was probably advantageous to get you into grad school, and it remains crucial to get you to the level after that: a great job in industry or academia. So choosing the school which is best for your career can feel like an obvious choice. However, a PhD is a very long journey, and choosing your grad school based on this perspective alone might make you more vulnerable to burn-out, disillusionment, and general dissatisfaction.

In this blog post, I will discuss this career-centered perspective in detail, but I also provide you with three other views that hopefully help you make a balanced choice that not only leads to academic success but long-term satisfaction and a full and rich life. Balancing your decision based on all four perspectives probably leads to a better choice than looking at one angle alone. Before I go into the details, let me briefly introduce these four perspectives: The Career Perspective, the Identity Perspective, the Stability Perspective, and the Variability Perspective.

A quite intuitive perspective is the Career Perspective, which is about determining and weighing the factors that help you to be successful in your PhD and have a successful career.

A different perspective is the Identity Perspective: not looking at your career but at who you want to be and how your choice enables and facilitates that identity. The social environment that you are in has a strong causal effect on your development: We are strongly influenced by the people and culture around us, and the friends of friends that you do not even know will make you honest/deceitful, selfish/selfless, caring/exploiting, and so forth. If you choose a school where the unwritten motto is “The worth of a person is measured in papers and citations” you will slowly but surely grow to be a person that would live by such a motto. Would you like to be such a person? So by choosing a school you in some way also define and constrain the person that you can become.

The Stability Perspective says that choosing the “right” school is an illusion but that there are other choices that matter much more because they give you the stability that you need to succeed in the arduous PhD journey. It is well known that the effect of most moderately painful or enjoyable events that significantly affect your life will wear off within about two years and that you will return to your baseline happiness and stay there. However, some things are more stable. A great and friendly social environment where you always feel supported and not alone will provide you with the most human needs and will make a 5-year-or-so journey a breeze. On the other hand, a tiny research group with a distant advisor will make for an uncertain, lonely, and stressful 5 years.

Another valid way to select a school is by the variability of experience it will offer — the Variability Perspective. You probably sacrificed in some way to get into grad school. You neglected your passions outside work, neglected friends or your partner or your family, neglected self-development, neglected to work on your mental, physical or spiritual health, or you neglected other things that are important to you. By choosing the school that is best for your career, you might very well continue on this path of neglect. When does it stop? Once you have completed an excellent PhD, you might labor on by choosing that super competitive assistant professor job, then tenure, then being a leading figure in your field, and so on. There is nothing wrong with such a path through life, but continuous exploitation will lead to local minima. The two most common regrets of the dying are “I wish I’d had the courage to live a life true to myself, not the life others expected of me” and “I wish I hadn’t worked so hard.” The dying probably would have avoided their situation if they would have known better. Making sure you have the time and opportunity for further exploration is very helpful in gathering the information necessary to make better choices in the future that do not lead to regret.

The Career Perspective: Choosing Based on Expected Success

The career perspective looks at the most critical factors for your academic success and success beyond that and chooses the school that is best according to these factors. Let me go through each factor. I list the factors in order of importance, starting from the most important.

Advisor

Finding suitable advisors is probably the most crucial task when choosing between grad schools. One could even go further and argue that one should not choose a school, but one should choose an advisor. A lousy advisor can make you miserable, unproductive, stressed, and might be the main reason why you would want to drop out of the program. The right advisor will help you to be productive, stay healthy, and help you to enjoy doing your research. It is important to emphasize personal fit here: Some advisors are great for you and bad for others and vice versa. The following criteria will help you identify advisors that might be better for you than others. However, there is a great deal of gut feeling to this decision. It is a bit similar to dating, even if everything is right on paper doesn’t mean this is the right person for you.

Another important note is that you should be looking for advisors and not a single advisor. This complicates an already complicated process, but it is risky to choose a school based on a single person. Relationships are complicated, and things might not work out as expected with your advisor. If possible, you should have an alternative advisor to whom you can switch if it does not work out with the other advisor. This strategy also offers the possibility of being co-advised — two advisors that complement each other may provide a great fit even though a single advisor might not.

The following advisor-related factors do not have a particular order.

Research Style

Research style is probably the most elusive quality but also by far the most important quality that you acquire during your PhD. While many would say that the goal of a PhD is to become an independent researcher, the truth is that with the steep requirements for ML/NLP PhD positions, many students are already somewhat close to being independent. They can generate ideas with ease and execute them confidently in research projects. However, the actual quality that new students lack is research style.

Harriet Zuckerman is probably the person that studied scientific expertise to the greatest qualitative depth. In her work Scientific Elite, she interviewed almost every US Nobel prize laureate of the 20th century. She found that these individuals often rose through the ranks through accumulated advantage. One advantage helped them secure the next position/grant/collaboration, which increased their advantage and helped them secure the next one, and so forth. Zuckerman found that the main advantage gained through this ladder-climbing was not necessarily more resources (money, equipment), but having the opportunity to culturally adopt the research style of other successful scientists. Consistent with this, most future Nobel laureates have been advised by Nobel laureates or would be Nobel laureates. So good questions to ask are: Can my research advisor’s research style help me in my career? Do I want to be a researcher that follows the style of my advisor?

While your advisor will be the focal point of research culture, research culture is also created through interactions between your advisor’s lab students. It is usually subconsciously adopted over time. Most students might not be aware of how they were shaped by their advisor and research group. It happens automatically and does not necessarily require explicit thinking or effort.

To give you some examples of what facets of a research style might look like:

Ideas are cheap and belong to the research group. Execution of those ideas as research projects is real research.
Novel ideas are everything. If someone publishes something even remotely similar to what you have, you should give up the project and work on something nobody is working on.
Good science is good math. A paper should be mathematically solid so that it will stand for years, holding valuable insights and generalizations that go beyond the current theoretical application.
Good science is robust science. A paper should have careful claims with robust evidence. This will help make the field progress more quickly by providing reliable information to build on.
Good science is a good research vision. A paper should be about what is possible in the future and where a line of research could lead to. Evidence augments vision, but a paper without vision is blind, incremental, and will be forgotten.
Good science is good insight. Some insights can be extrapolated and be applied to many other scientific problems, many of which have not been formulated yet. Finding and expressing these insights is vital for scientific progress.
It is all about productivity. Research is inherently noisy and messy, and it’s tough to predict the outcome of an idea or set of experiments in the development stage. Navigating this uncertainty is best done through fast iterations and balancing multiple projects to maximize the chances of a big success.
Good science is collaborative. Different people can bring unique perspectives to a project and increase the chance of serendipitous insights. Collaborations bring the best out of people and can result in a sum that is larger than its parts.
Good science is solitary. To gain the deepest insights into a problem, one has to understand a problem in its fullness without outside help. While collaborators can join later, the all-encompassing understanding of a problem through solitary pursuit is critical to tackling the most important scientific problems and for growing as a researcher.

These are just some examples, and usually, a research style is made up of a multitude of facets like these. Research style is complex but can best be encapsulated by the questions “What does good/bad research look like?” However, if you ask these questions during the visit days, you often find that people answer what they think good research is supposed to look like, rather than what it looks like for them. Therefore, better questions for visit days are:

“What are examples of research papers you like?”
“What research papers (in your area) do you think are the most important ones in the last years?”

These questions often reveal what people think are important problems and the “correct” manner of approaching these problems. Both qualities are strongly related to research style.

Since the acquisition of research style is largly automatic and subconscious, it is crucial to understand which research style you will be adopting by joining a particular school and lab/advisor. So, what can such adoption look like?

Take a friend as an example who, at the start of the PhD, can be culturally described as a minimalistic hacker-researcher—someone who tinkers around with minimal changes to a system to improve it in a simple manner. He teamed up with an insight-driven neat professor for his PhD. After a couple of years in his PhD, he learned to be an insight-driven hacker. He builds hacks, understands the deep relationships of how his hack affects the system, and then extracts this insight in the most minimalistic and well-formulated way possible along with his practical hack. The combination is a pretty potent mix: the minimalistic insight-driven hacker-researcher. This person finds small hacks that yield robust results and insights into how other research and the hack relate to other concepts.

One friend described me as a product-driven experimental hacker, meaning someone that rapidly prototypes changes to a system and tests them experimentally for reliable effects. If reliable effects are found, the hack is extracted into a product that other researchers can easily use. I was pretty surprised by that view at first, but I now think it pretty much hits the nail on the head.

Some friends I would describe as:

concept-centered experimental visionary
gregarious cool-stuff-can-be-good-science collaborator
mind-the-gap collaborator
principled neat-and-tidy collaborator
I-like-cool-stuff researcher

It is important to note that there is neither right or wrong nor good or bad research style. For example, in Zuckerman’s work, two Nobel Laureates in the same field would sometimes have radically opposite research styles, yet each different research style made both Nobel Laureates and their students successful. Similarly, while an I-like-cool-stuff style sounds unimpressive, the Feynman-like playfulness of an I-like-cool-stuff researcher might lead to significant discoveries that others overlook because others do not deem these problems serious enough to consider working on them.

Looking at my friends, they often came in with a particular mindset, and looking at them now, they very clearly adopted the central cultural tenet of their research environment.

It can be very empowering to enter grad school with this view. A friend of mine entered grad school and, upon hearing this interpretation, actively thought about how he could augment his research style with a particular advisor’s style. He switched advisors until he found the right ones. Then he leaned in and tried to adopt the advisor’s central cultural research facet as quickly as possible. My friend’s primary research advisor told him four years into the PhD that there is nothing left that he can teach him and that he should graduate and move on to learn more. I was not surprised and think it directly related to my friend’s viewpoint that adopting a research style and developing research taste is the most crucial element of a PhD.

So while elusive and hard to define, the research style of a particular advisor or department can be an essential consideration to choosing the grad school that is right for you.

While the following sections will dive into other angles on choosing potential advisors, they can also be interpreted as sub-component of research style, particularly the advisor values section.

Advisor Research Fit

Students often do not know what to look for in an advisor and often cling to the idea that they need to find an advisor that is interested in the same research that they want to do. There is some truth to this idea, but this idea is more dangerous than it is helpful. From my friends in the 2nd year, about 66% changed their research direction completely — many of them in the first year. That number is higher if I look at later years. Most of them still work in their subfield (robotics/NLP/vision), but they switched to different research areas in those fields. Some examples:

Multilingual parsing -> multilingual models -> machine translation
question answering -> dialog -> reinforcement learning -> semantic parsing
NLP architectures -> machine translation -> model efficiency
human pose recognition -> sim2real
question answering -> model efficiency -> interpretability -> model efficiency

What you see from these transitions is that an exact fit is not needed with advisors since your research interests will change. The same is true for your potential advisors: they might no longer be interested in research that they are well known for, or they might be interested in a direction which they have not yet published in. Compared to students, advisors have much more breadth and might be equally interested in many different research directions at once. Furthermore, while new professors are often compelled to stick closer to a specific research direction until they get tenure, tenured professors can be very flexible in research directions, and their interest might also be influenced significantly by the interests of their students. More senior professors are often happy to take on a completely new research direction that is interesting to you and compelling to them — this can be the advantage of hands-off advisors, which I will talk about in the next section.

Despite the overall fluidity of research interests of both you and your advisor, it is a good idea to have at least some overlap. It might be worth asking about the advisor’s long-term research vision, but be aware that such plans are often not well fleshed out and can change quickly based on changes in the field (e.g., BERT). It might also be worth looking at the values of an advisor because they are rather stable over time, and they can hint which kind of research they like — more on values later.

Advising style: Hands-on vs Hands-off

Advising styles can be mainly separated into hands-on and hands-off styles. What does this mean?

In general, what you can expect from a hands-off advisor is that you do all the work, and your advisor gives you feedback on what you have produced. For a hands-on style, the advisor also helps with the producing in some way.

More concretely, a hands-on advisor might be helping you with many details of your research: Brainstorming research ideas, discussing research ideas and problems in detail, help define research problems and ideas, thinking about a narrative for your paper, formulating claims, structuring research project into certain pieces with milestones, checking in frequently to discuss partial results, discussing programming problems and bugs, providing rapid feedback, steering the project to prevent failure, providing detailed feedback for the write-up, providing detailed feedback for presentation slides – all of these are signs for a hands-on advising style.

Hands off advisor will be helping you with high-level details of your research: Discussing viability and impact of a research idea, discussing research narrative/pitches/claims, discussing research results, providing (high-level) feedback on final paper draft and slides. Working with a very hands-off advisor has many benefits, but in terms of direct help and interaction with your hands-off advisor you often cannot expect much more than I list here.

The hands-on / hands-off dichotomy is a continuum — usually, an advisor exhibits a mix of these traits. For example, some advisors might be very hands-off, but are very involved in idea generation, while yet others really like to give detailed feedback on writing. Usually, advisors also adjust slightly to the needs of each student and can be more hands-on in research areas where he or she is well-established. It is useful to talk to students to get the exact details in which areas the advisor is hands-on or hands-off. Areas here can refer to activity areas (help with writing, brainstorming ideas, thinking about a research story, etc.), technical areas (helping with bugs/code, finding the right software framework), and research areas (machine translation, question answering, etc.). So you should not ask students, “Is your advisor hands-on or hands-off?”, but instead you should ask, “Is your advisor hands-on with giving feedback on writing?” and so forth. Ask about the areas that are most important to you (your weak areas).

A hands-on advisor is great if you are less experienced in research, need more structure and deadlines, are unsure about potential research topics, and are externally motivated. A hands-off advisor is great if you want more freedom and independence, and also, if you want to learn more through failure and adversity — being on your own for a good portion of the PhD can be difficult, but it also makes you a better independent researcher. If an advisor is not overly helpful, that is great for you in the long-term, but it can be difficult for you in the short-term, especially from in the first year or if you need to navigate important milestones such as conference deadlines.

Usually, hands-off advisors are more senior and can also provide more connections for internships and collaborations and are able to link ideas to some good-old research ideas that most people forgot about. They usually also have a more extensive lab with postdocs and a range of senior PhD students, which can provide valuable hands-on advice.

A hands-on advisor can usually develop you in more detail and, in the ideal case, will provide a gradual increase of independence towards the end of your PhD. Through this process, you will become similar to your advisor since a hands-on advisor develops you in his or her image. This can be a good or bad thing, depending on what you want to do with your career. If your hands-on advisor’s research vision is highly sought after in industry or academia, it is an advantage; if the market is saturated with the same research vision, you are just another fish in the sea.

Advisor Values, Strengths, and Weaknesses

What does the advisor care about? This is often overlooked, but the values of your advisor can make or break a good fit. It also defines the environment within the research group. Why do values matter?

As noted above, interests change all the time, but values are much more stable and vital for a healthy relationship. While differences in interests are often fine (machine translation vs question answering), differences in values can create conflicts (overclaiming vs underclaiming). In general, you want the same as in any relationship: Share as many values as possible and have differences in strengths and weaknesses which complement each other. So what do values in an academic relationship look like?

Neats vs Hackers

One fundamental difference in academic values is if your advisor is a Neat or a Hacker: A Neat is someone that values systematic investigation, sound assumptions, proofs, precise claims, and theoretical progress whereas a hacker believes that adherence to rigid schemes slows down progress. A neat is careful in their methodology, cautious in making claims, and lets results speak for themselves: “Another solid result for the literature.” A Hacker first and foremost values results and their impact and practicality: anything “that works” is acceptable. A hacker values unconstrained exploration, integration of things “that work” and progress that makes a difference in the real world. Hackers are usually less careful with making precise claims because they believe it is more important to think about the (yet unproven) potential and possibilities of an idea. Hackers like to show off their work: “Look at this cool method — the results are unbelievable!”

None of these roles is inherently more valuable than another — both are needed to make progress in science. The best results in science often come from critical discussion and work across these camps.

This is also a continuum. I am a Hacker at heart, but I get offended if someone misuses (or does not use) statistics or if someone makes theoretical claims built on weak theoretical foundations.

Discretion and In-group cohesion

Does the advisor value discretion, privacy, and is open at the same time? This encourages honesty and directness between you and your advisor, but you might know less about what other students work on and how they make progress on their project. A lack of such information might feel isolating. On the contrary, an advisor that tells you about his or her other student’s projects and progress makes it easy for you to get involved with other students, which facilitates in-group collaboration and cohesion — you stick together and support each other, and it feels a bit like a family. The problem with that is that if you say something, everybody will know soon enough — so you need to be careful what you say which can be stressful and can lead to a culture of closedness or faking: “Everything is okay with my project — I do not need help!”

Well-being and Research Progress

Does the advisor value your well-being over research progress or vice versa? An advisor who values your well-being will make sure that there is freedom for work-life balance and that your 1-on-1 meetings are not only about research. While your mental health and stress levels are first and foremost your responsibility, an empathetic advisor will be able to see if you are overdoing it and can offer guidance to avoid overwork and burn-out. On the other hand, such an advisor makes it easier for you to slack off and have research projects slide into oblivion, which can stop progress and make you feel depressed or make you feel like a failure.

Advisors that push you to your limit to do research might be a good fit if you need some pushing to be productive. However, too much pushing, or if you do not like to be pushed, might cause burn-out, high stress, or might make you anxious to meet the high expectations of your advisor.

Communication

Does the advisor value sharp, direct criticism or indirect, gentle hinting that something is off? If your advisor values head-on criticism, he or she will call out bullshit and tell you how much your project idea sucks. This is difficult to take as a student that labored hard on a research idea. On the other hand, you do not need to waste more time on this idea and can move on. If you are able to remain calm and digest such feedback, then you might be able to quickly adjust an idea and make it work. With such an advisor, you know a research idea is air-tight if he or she gives you good feedback, and it makes you proud that this idea “passed your advisor.” From there, it is easy to move on and work on the idea.

An advisor with an indirect communication style will hint that something is off, but you might not know what or why. That can make progress slow, or it can create considerable uncertainty if your project is any good even months into the project. However, your feelings are not hurt with such a communication style. Furthermore, this indirectness might also demonstrate the intellectual humbleness of your advisor: if an experienced advisor believes he or she can be wrong, it might open up the possibility for a candid dialog of exploring what is true and what is not. This is an admirable quality that many intellectuals value highly, and it might rub off onto you. In the long-term, indirect communication has the advantage that you need to think about problems more by yourself, which makes you more independent and a better researcher.

Strengths and Weaknesses

As noted before, beyond values, it is also essential to think about how you and your advisor’s strengths and weaknesses complement each other. This is generally important for collaborations. For example, you might be great at executing research projects quickly so that you get the evidence to decide along which path you push the project, but you might be bad at generating good research ideas. An advisor that matches your core values and complements your weaknesses — idea generation in this case — will make a great tag team partner and will make it easy to wrestle those challenging research projects into submission. On the other hand, sharing weaknesses can make you and your advisor blind to problems in your research. Good advisors will recognize your weaknesses and strengths and will try to complement your style of research.

Self-reflection Key to Good Decision

To understand the relationship between values, strengths, and weaknesses between you and your potential advisor, it might be well worth it to find some time for a session or two of careful self-reflection to understand who you are and how you align with potential advisors and schools. Beyond alignment, it might also help you to identify schools and advisors, which possibly facilitate growth toward specific values and strengths that you cherish but not yet possess.

Some questions that could get you started: Can you deal with direct, sharp criticism? How much do you value your privacy? How much honest and open conversation? Are you more like a Hacker or more like a Neat? Are you a “family person” and favor very close cohesion within the research group? How self-motivated are you? Do you need deadlines and milestones to keep you motivated and on track? Do you work well if someone pushes you? How much work-life balance do you need?

Advisor Availability and Absent-mindedness

Availability

Availability does make a difference. It is better to have more frequent meetings, even if these meetings are with more hands-off advisors, and you have no results. If an advisor also works at a startup/company or has many students, it might be that meetings are infrequent, canceled, postponed, and additional meetings in the time of need are not possible. However, availability is not only about your advisor’s busy schedule but also their attitude. For some advisors, student meetings are “holy” and are rarely canceled or rescheduled. Some advisors are also open for frequent meetings in times of need while others are not.

Absent-mindedness

Another closely related factor is absent-mindedness. There are advisors who forget about projects, and you have to explain them over and over again what you are working on. Even if an advisor is available, a certain degree of absent-mindedness can make interactions frustrating and unproductive. On the other hand, similarly to a hands-off advisor, this forces you to think carefully about your project and formulate exact problems before a meeting, which will make you a better researcher in the long-term. Being able to formulate your project as a concise elevator pitch with a precise definition is a highly valuable skill that will impress anyone whom you meet at a conference. The other extreme are advisors that reserve blocks of time to think about your project outside of meetings just on their own — which has obvious benefits: better feedback, guidance, and new angles to the project, which might improve it significantly. On the other hand, this can make you dependent on your advisors’ thoughts, which can prevent you from becoming an excellent independent researcher.

Peers, Postdocs, and Research Group

Peers

The peers and the research group are the second most important factor to go to a school, and this factor is not far behind the advisor in importance. Regarding research interest, it is a bit similar to advisors: Your peer’s research interests change over time but will usually stay in a related area. As such, it might be possible to have long and fruitful collaborations with particular students, but probably it is more realistic to see people in your research group as peers with whom to discuss research ideas and get feedback from.

But there are other things which are robust over time, such as general interests and values. During your visit days, you get to know some of your possible peers — both other admits and people in research groups — and sometimes it is evident when you “click”. Although it is difficult to get to know people in detail in this short time, a group of people that you click with might be a good reason to go to that school. If you have a friend who supports you through the difficult times and who challenges you to grow will be very helpful in your journey through a PhD and beyond.

Research Group

Beyond individual peers, you should also consider the research group of your potential advisor in your choice of grad school. The dynamics of the research group are quite revealing about the norms and values of the research group. The values of the advisor (see above) shapes the dynamics in the research group strongly. You can use a similar framework as presented above to assess the values and expectations of your peers within a research group.

Another critical view on research groups are the power dynamics and diversity, which are strong predictors for the success of the overall research group. Research says groups work best if a powerful individual brings together people with very diverse backgrounds, views, and experiences, and once they are among themselves gives up his or her power and lets these people collaborate on an even playing field.

Diversity is particularly important for creative endeavors because diversity helps to prevent echo-chambers. Let’s say you have a group of hackers that reads about a new research method A:

Hacker 1: “Wow method A is so exciting. The results on Task C are so great! It would be so cool to mash it together with method B and try it on task D!”
Hacker 2: “You are right, that would be so interesting!”
Hacker 1: “Let’s do it! Let’s hack it together!”

It is a very different dynamics if you add some neats into the mix:

Neat 1: “From Author, et al. 2020, I know that the standard deviation on task C is quite high and I think confidence intervals from method A would overlap with method X — the results from method A do not seem any better than results from the simpler method X. So, by Occam’s razor, I do not think there is any reason to extend method A.”
Neat 2: “I think their performance is mostly explained by their unusual initialization rather than method A. With that initialization you expect lower relative differences between the eigenvalues of the Hessian and thus faster training —so I think the number of epochs is a confounding factor and their comparison is invalid. I do not believe method A is actually better. They should have used the same initialization or at least done a grid search over learning rates and epochs for a proper comparison.”

This is an example of a neat vs hacker debate, but the same goes for many other traits and values. For example, if you have only people who discuss ideas with direct, blunt criticism, the interactions can feel pretty overwhelming and intense, and good ideas might be lost within the group because it is too tiresome to talk about it. Instead, a mix of playful and serious people might be able to balance the free idea generation with rigor and carefulness.

Other extreme dichotomies might include theory vs applications thinking: “Life is temporary, only proofs are eternal.” vs “If you make the “greatest” invention ever and it does not affect a single life, then what is the point of that?” Quantitative vs qualitative thinking: “If you cannot measure it, it does not exist!” vs “Try to measure how much you love your spouse and then tell me which number it is — it does not work!” There are probably many more of these extremes.

Of course, virtually nobody believes in these statements, but some people identify more with one than the other, and having a healthy mix of each of these perspectives within a research group prevents groupthink, bias, and unreasonable extremes.

Postdocs and Senior PhD Students

Briefly mentioned above, postdocs and senior PhDs can also have a tremendous impact on the advising situation and should be considered carefully in your choice. If your advisor has postdocs and senior PhDs which frequently collaborate with new PhD students, it can be a big win for both parties: You get additional hands-on experience, and a research perspective which is different from your advisor (especially with postdocs) and they might be able to get another publication before they move on to the next job. Having senior PhDs and postdocs is, in particular, valuable if your potential advisor is hands-off — in this case, you can get the best of both worlds in terms of advising.

Others

Other important factors for a good research group are how much ideas are shared and discussed (what happens in a regular research group meeting) and how much students collaborate (easy to check by looking at their publications). The degree of collaboration is also a good proxy of group cohesion. I will talk a bit more about the importance of socializing in research groups further below in the “Stability Perspective” section, and I will not repeat myself here.

School Name and Resources

Accumulated Advantage

To make rational choices about the prestige of a school, it is essential to understand why it actually matters.

The scientific reason why school names matter is that they represent a proxy of accumulated advantage, which is a good predictor of current performance. Cumulative advantage is the idea that the more privilege you had in life, the more likely you had the resources (money, educated parents, mentoring, good peers, free time, extracurricular activities, extensive social network) to do well (rapid development, good grades) and this gives you more resources (better schools, better jobs, better connections) to do even better (promotions, tenure, grants) which yields even more resources (even more extensive social network, collaborations, grants, funding) to do even better (Nobel prize, Fields Medal, unicorn startups).

The distribution of advantage at any of these stages is highly unequal with the top few percents being the most productive and gaining the most resources: ⅓ of the US population get a Bachelor’s degree, 2% a PhD, 0.2% a top 20 undergrad degree, 0.06% a tenure track position, 0.0006% of people publish 41% of papers in research journals. But at the same time, at a top school, 73% of PhD positions are given to people with undergrads from the top 20 schools, and the top 18 schools produce 50% of professors. We can do some back-of-the-envelope calculation with these data by making some simplifying assumptions to calculate the probabilities of becoming a professor if you do a bachelor or a PhD at a top 18 school. If we assume that the 50% of professor from the top 18 schools are equally distributed then 1/36 of all professors come from each top 18 school.

Thus if you do a PhD at a top 20 school, your prior probability of becoming a professor jumps from 0.06% to about 2.8% — about 50 times more probable, but still only as likely as rolling two sixes with a pair of dice. This means you can increase your chances dramatically by choosing a prestigious school, but the odds are still heavily stacked against you. Similar statistics hold true for making other choices based on prestige or school ranking. Making a choice based on school ranking alone will probably not lead to success. Other factors, like a great advisor, great peers, a productive research group, school culture, and social opportunities, are probably more critical for success.

Some Failure and Adversity is Critical for Success

A different perspective that might seem unintuitive at first is that a long streak of privilege can have harmful long-term consequences for you. Failure and adversity are great tools for personal growth and growth as a researcher. This is a well-established finding in psychology: To succeed in life, you need to fail sometimes but not too often. The intuition is that the extremes of privilege or adversity lead to poor mental models of perfectionism and learned helplessness, respectively, while occasional failures lead to a mindset of learned industriousness. This means, too much privilege will make you afraid to take risks and fail because you never failed before. Occasional failure will make you resilient because you know adversity is normal and temporary — a mindset that enables the pursuit of creative but risky ideas.

For example, if you are at a top 20 school, it might be expected of you that you behave like a top 20 school researcher: Publish many world-class papers in a short period of time. Such a competitive environment might encourage “safe” research that is easily publishable over creative research that is prone to fail. Such a school, while providing a boost in privilege and resources, might prevent you from becoming a successful and creative researcher in the long-term. Challenging yourself in a non-perfectionist way is important — make sure there is enough opportunity for lessons learned through failure at the school that you choose.

Similar strategies are also used in industry: Failing a startup is often seen as a requirement for founding a successful startup. Joining a scrappy startup might make you a skilled engineer while joining a big tech company might lead to stagnation in skill — you are just another cog in the machine.

Computational Resources

After the release of BERT, some of my peers felt energized by the exciting results of large pretrained models, but equally many of them seemed defeated. It must be excruciating to see your research sub-field and the research that you worked so hard on being crushed by the simple idea of throwing more GPUs and data at the problem. But this is the reality that we live in. With the advent of GPT-4, we reach another of such critical points, and one should take care that one is in a position where one can do meaningful research after GPT-4. What does this mean?

In most sub-fields, there are general ideas that are unaffected by scale or even incorporate scaling into their outlook. As such, hope is not lost even if your research will be affected by GPT-4. However, this may mean that the research you are doing will be very different before and after GPT-4—just like it was with BERT. While it is very unpredictable how research will shift with models like GPT-4, it can be worthwhile to think of possible ways your research could adapt in specific scenarios.

While it might be completely counter to your current research and values, it might be helpful to think about hypothetical worst-case scaling scenarios. Consider this example that may border on outrageousness depending on your values. While most bias and fairness research in *CL conferences mostly frames large pretrained models as a problem, there is already a hint that scaling models might solve such issues if used correctly. As such, a worst-case hypothetical to ask yourself could be: “If scaling laws for bias and fairness show that scaling resolves bias and fairness eventually with scale, what would you do?” An answer to this particular case might be among others:

Shift your research to the broader perspective of human preferences, which already has a foothold into scaling.
Incorporate scaling laws into your research and analyze the properties of models/data/methods that lead to improved bias scaling, particularly at a smaller scale or with fewer resources.
Try to scale differently from computing and data, for example, by using reinforcement learning on bias and fairness user feedback data.
Think about alternative research sub-fields you might want to switch to if this happens.
Relate bias and fairness to some effect known to be a factor at scale. For example, how does memorization relate to bias and fairness?

The next question to ask yourself is this: how many computational resources will I need to proceed in this manner? And with that, it follows: will I have these computational resources if I join a particular school over another?

School location (campus & city)

I will not elaborate here since I will address important considerations for these factors, mostly in the Stability and Variability Perspective. To foreshadow a bit how to think about these: The campus and city with its possibilities will offer opportunities that help you to do the things that you know will ground you and make you stable so that you can sustain the difficult journey that a PHD is (stability). Each city and campus also offers a different range of new activities and experiences (variability), which help you to explore who you are and what you like and make you a fuller, more vibrant human being.

The right way to think about this factor is very personal and can either be an insignificant factor or a factor which is more important than the other factors listed before — it is worth it to stop, think and reflect quite a bit on this. But more on that in the Stability and Variability Perspective sections.

Other Factors

There are other factors that I could write about, but they are not that important. Housing, living costs, stipend / salary are not that important. There are differences, and one school might pay more than another or has lower living costs, but the outcome will be the same — you will not be rich, you will not be poor, and it does not matter what your living arrangements are it will feel like home eventually. At some universities, you can work part-time in an industry research lab, and you can make much more money, but it also adds extra complications.

It also makes sense to consider the university culture and the research group culture, but these are quite closely tied to the values that your potential advisor, peers, and research group holds. Culture is also closely related to the Identity Perspective, which I will introduce next.

The Identity Perspective: Who do you want to be?

If you choose a particular school, you will be actively shaped by the environment you live in and the people you interact with on a day-to-day basis. The identity perspective is then the perspective where you try to optimize for the person that you might become. While the career perspective looks at the question “How much success am I expected to have?” the identity perspective looks at the person that you might become and asks, “Do I want to be that person?”

Choosing a school based on who you think you will be is a very personal and subjective choice. I do not believe some specific examples will help you understand how to think about this choice. Instead, I want to give you my personal experience and how that experience shaped my belief about the person that I might become for each school. I gathered most of this experience during visit days where people tend to show their best selves, but I also experienced interactions at conferences and internships, which might have been a bit more authentic. I believe the aggregate identities can give you an accurate picture of the person that you might become.

My Experience at Visit Days

While I have mixed experiences from students of most schools, there is one school from which I never met someone that was nice or treated me with decency. They often ask where I study, and if the answer is not the right university, they will move on to people that actually study or studied at these “right” universities. Sometimes, they would look at my badge if it showed my university (Università della Svizzera italiana); they would proceed to ignore me in a conversation with other people. During the visit days of one school, someone looked at my badge that displayed my undergrad university, The Open University, and said: “Wow! It is nice that they give people like you a chance!” He left before I could respond. I am sure there are friendly people at these places, but if I meet 15 people from the university and they are all very superficial and disrespectful, that is quite telling. So do I want to be a vain, shallow, and rude person? No, thanks.

At another school, I had the most alienating and isolating experience I have had during my visit days. People from elite universities formed cliques and did not let other people in. My accommodation that I had to make because of flight scheduling was not paid for. I felt as if I was being made fun of for my food preferences. My meeting with a potential advisor was botched, and I needed to share a time-slot with another student. This happened twice. One potential advisor was not there and did not try to contact me before/after the visit days. Another potential advisor belittled me in the meeting I had with him. For many people that I met at that school, it seemed that they felt the need to put on a happy face when they were actually very sad or stressed. Do I want to be a person that supports an environment where anything goes, where deception is the norm? Do I want to be a person that feels the need to show off a “happy face” even when miserable? No, thanks.

My Experience at the University of Washington

In stark contrast, the environment at the University of Washington (UW) was designed, so nothing can go wrong. I felt that everyone had a good time at the UW visit days. That at least shows that people are conscious and aware of social dynamics. What was most striking to me was the student panel: A visitor brought up the question of mental health and stress during the PhD. The panel went all out, talked about their mental health issues, and how they coped with it and what mental health resources are at your disposal as a UW student. Similarly, many students were honest, open, and emphasized that the students are a team and that they look after each other. They also made clear that time outside of work is very important to them. One other very endearing thing to me was how friendly people in Seattle are in general. When I arrived in New York, I could immediately feel the tension and impatience. The opposite is true for Seattle. There is a certain gentleness about things. Drivers are more patient and responsible. In Seattle, people exit through the backdoor of articulated buses and yell to the bus driver: “Thank you!” And in their faces, you can see that they do not just say that for show or to conform to social norms — they mean it. So, do I want to be an honest and open student that admits to his struggles, supports his peers, is collaborative, has a life outside work, and who kindly thanks the bus driver? Yes, please.

You might think it is silly that you decide on a grad school based on if people in that city thank the bus driver. But really it is not! Since I started at UW, I have embraced being a bus-driver-thanking person. Maybe this made me more kind. Maybe that makes me more appreciative of the hard work that all the people around me do. Maybe it made me write this blog post with which I hope to help you in your difficult choice. Maybe, if I would have chosen another school, this blog post would be about a cool startup idea; or perhaps I would just have spent that extra time on more research. Identities matter. With the choice of studying at UW, I, in part, chose who I want to be.

If I take off my UW hat, I can see that one could also have a very different interpretation. Maybe the mean people that disrespected me at the elite school were just protecting their limited resources by giving time to people that probably mattered more. Perhaps the lovely people from UW are socially naive — trying to make everybody happy, which is clearly not possible. Maybe it is wiser to concentrate resources where they matter.

It can also be different for different people. I have a good friend who is very blunt and direct — totally normal for the country he is coming from where this bluntness is a signal of trust and honesty. However, calling out bullshit in a blunt and direct way does not fit in nicely with the UW identity and culture, and that can lead to misunderstandings and problems.

The Stability Perspective: Schools do not matter, but what does?

Grad school is incredibly tough: By definition, the final goal of the PhD is to gain the skill to independently explore and confront the unknown to produce new knowledge. This requires a lot of self-motivation, enduring failure and rejection, and lots of hard work. I often heard how difficult a PhD is, but I did not believe it. Now I understand what that means. And it has not only been tough for me, but for most of my peers. As such, you want to have something to cling on to that stabilizes you and helps you to cope and enjoy the experience.

The stability perspective acknowledges that many factors, such as happiness, gradually revert to a personal set-point while other elements are stable over time and provide you with energy, resilience, and courage in the long-term. As such, the stability perspective is about prioritizing factors that you know will help you to have a successful grad school experience in the long-term. Research shows that relationships are the most important and stable source of well-being. So it is crucial to look at the social environment when you choose a grad school.

Usually, within a grad school, the social environments are the office, research group meetings, other group meetings, lunchtime, and social activities organized by the department or by grad organizations. Fewer research groups have social outings as a group, but that definitely helps to make grad school more enjoyable and manageable.

One reason why I really wanted to go to University College London (UCL) was that I already knew the people there, and they were super friendly and helpful. Sebastian Riedel is an absolute great advisor and a very wise person, and it was a joy working with him. But another important reason why I wanted to join UCL was actually the daily lunches.

Someone would announce “lunch” in the office. Some would go downstairs to buy some food. Some people brought some food. Then we sit all around the table. We would chat about our everyday life, everyday problems. Sometimes our passions or the one or other curiosity. Some politics and news. Sometimes some research ideas. It felt like a family where people cared for each other. It was great! It gave me the energy and focus to do great work even after lunchtime when I am usually tired and less focused. If I would know, I could get this experience at a school or research group, this would definitely be part of the reason why I choose that school.

A great source of stability for me at the University of Washington (UW) has been my office. As desks freed up, we moved more and more NLP people into our office. Now we have an NLP office where we chat about research, support each other for deadlines, and in general, take care of each other, and it feels great. If I had known that I can have all these great, friendly peers around me, this would have been another reason for UW.

The office environment, group meetings, and having more social lunches is something that will keep you stable and mentally tough throughout the PhD and can be a valid and important reason to choose one school over another.

Beyond the social environment, there can also be fundamental personal reasons to prefer one school over another. These are usually not discussed much because they are too distinct — I will give you some examples anyway, which might be a guide on how to think about your personal reasons.

For me personally, Stanford was one of the top choices. Stanford is impressive academically, and I had a great fit with potential advisors there. However, one other thing that stood out for me was the bicycle track around Stanford. I am an avid inline-skater — or rollerblader, as it is called in the US — and inline-skating gives me a lot of stability. It is vital for my mental health. The joy and freedom I feel while skating helps me to get through the dark periods in my life. The bicycle track around Stanford is an absolute dream if you are a skater: Very smooth, flat, dry weather. I imagined myself getting up at 5:00 am and skating every morning through a deserted campus — what a pleasant thought!

Another popular topic are relationships, family, and friends. Most people do a PhD because it is their passion. If you can combine your passion with a great partner, it could give you all you need to flourish as a person. If you can go to a school together with your partner, this can be a good reason to choose one school over another. If you make this choice, you should, however, also beware that doing a PhD is a significant stressor for a relationship, and you should think about if you would like to stay at a particular school if your relationship would end. From my friends who started the PhD with me, for most of them, including for myself, their relationship ended partly due to doing a PhD. A PhD is not easy for a relationship: Moving across country or continent, adjusting to a new culture, working long-hours, little pay, night shifts, and high-stress before deadlines. The pressure and stress from a PhD can make you depressed, anxious, absent-minded, and unresponsive — not the ideal state to be in for a relationship. You get used to this and learn how to handle these stressors, but in particular, when you do your first year of as PhD it can be an enormous strain on your relationship. On the other hand, being able to bring your partner along or to reunite with your family will give you great strength and motivation. You will be able to push harder and further with your research. You will be able to cope better and recuperate faster. A PhD is challenging, and having the most important person close behind you makes it much more manageable.

The Variability Perspective: The possibility of a better you

The stability perspective was about choosing a school based on factors that you know will stabilize you so that you can do your best work. The variability perspective is about choosing a school based on possibilities that will enable you to become your full self — a flourishing human being. Possibilities mean you do not know for sure that these factors are important to you, but you have a hunch, a feeling, a common thread through your life that makes it look like that you just need to try certain things. Schools that enable you to explore certain unexplored parts of yourself and your interests have the potential to make you mature and a fuller human being with the right breadth and depth of experiences. Schools with low or the wrong kind of variability do bear the risk that at one point in life, you will stop, and feel that you lived a life that was not your own.

But it is not only about experiences per se but also about memories. Even the greatest moments pass, and your happiness will regress back to its mean — but memories will stick with you. Memories that you create will be your own for your entire life. However, if you look back at your most precious memories, it probably is not the time when you hit the library and studied really, really hard for that test. More likely, it is about a unique experience and moments which are emotionally meaningful, and it involves people that you care about. How likely are you going to have these exceptional moments at a school where it is common culture to work really hard on weekends to get in that extra paper for the next conference deadline? How likely are you going to have these exceptional moments at a school where the school is deserted on the weekend, and your advisor tells you “It is time to submit the final paper draft” 10:00pm on the deadline night, even though the deadline is 4:00am the next morning?

Academic excellence is great and important, but it is not all that matters in life.

You might have experienced that first-hand in this crazy competitive environment where it is all about coming out at the top to make it to the next stage: Be it PhD admissions, an excellent postdoc position, the superb research scientist job in industry, an assistant professor job, tenure, being recognized as a “great” professor and so forth. If you want to turn the hamster’s wheel — you can turn it all day long just fine. But as you turn the wheel over and over, you might realize, to your dismay, that you never made all the experiences that other people call common life experience.

Maybe you wanted to learn to play the guitar or try that cool sport at the gym, but then you realize the research deadline is in 3 months, and it will be tight — so you better put in those couple of extra hours! Maybe you had the feeling that you might really enjoy doing improv theater with the local group, but somehow you could not squeeze it in between classes and research. Maybe you always wanted to write some blog posts about that one topic that you are passionate about, but how can you justify spending your weekend on a blog post when on Monday you have a meeting with your advisor, and you do not have new results yet? Maybe you wanted to improve your social skills, and you want to ask your coworker to go out and have fun, but then you realize that all your coworkers have no time because they are stressing out about the next research deadline: “Let’s do it after the deadline!” If you find yourself in a trap like this, it might be time that you make choices that offer you a different range of experiences and opportunities.

The critical bit is that a school should not only have opportunities that interest you, but the culture should also be one that encourages the exploration of those opportunities. If you live in the best city in the world and have the best people around you, it does still not really work if your advisor and coworkers expect you to work on weekends and long hours and give you a hard time when you do not. Both opportunity and a culture that supports exploration are needed for a choice to have a good variability of experiences. The variability in experiences and memories that you will get is more like the minimum of those two factors — so also try to figure out how much freedom you have in your research group.

What might a concrete case look like where variability makes sense? Let me tell you a bit about my situation when I was about to start my master’s degree. During my Bachelor studies, I discovered machine learning and a bit later deep learning. I was hooked and realized that this is something I wanted to do for the rest of my life. However, I also knew if I wanted to get into a good PhD program, I would need to have research experience. The problem was that at the online university I was studying, I could not do research, and since I did not have any credentials at that time, nobody I contacted wanted to work with me on research. So I decided to quit my job, study full time, and do my own research during my online Bachelor studies. The work was relentless and fraught with dead ends and failure, but I did not want to give up. I was highly motivated to succeed, and so I decided to isolate myself and work tirelessly with an intense focus on a research problem that involved parallelization on multiple GPUs. It was a surreal time where sometimes months fly by without any human contact. I eventually succeeded, wrote up a paper, and published it at ICLR2016. This was a big success, but it took its toll. While my peers gained life experience, social experience, got to know who they are, what they like, and enjoy; I just learned how to do research and degenerated into a weird, confused, isolated hermit. On top of that, all my PhD applications were rejected, and my only choices were Master’s.

I did not want to do a master’s since I knew I would not learn much. I knew math, I knew computer science, I knew machine learning — it seems a master’s degree would just be a piece of paper, and I would not benefit from all the experience.

Enter the variability perspective.

When looked from the variability perspective, doing a master’s is an excellent opportunity to figure out parts of life that are unclear and to catch up on social and life experience. Since I was already pretty good at the things that I need to study for the degree, I could slack off in class and just focus on something outside of class. That is precisely what I did, and the experience was absolutely marvelous, and it made me into the person that I am today.

I chose the University of Lugano for my master’s degree. It had small, intimate classes perfect for getting to know everyone. The master’s degree was highly international, and usually, we did not have two people of the same nationality in a class. I overcame my social anxiety, just hung out with people, and developed my social skills. I made my first romantic experiences, which are still very special to me. I also learned that I do not like to hang out with people in bars, who would tell me how drunk they were the last time and what they did in their drunken state, or how nice their vacations were. But then I organized a weekly philosophical evening with two friends where we would talk about philosophy, neuroscience, psychology, research in deep learning, rationality, game theory, altered states of consciousness and how all these things relate and it was always lots of fun and very meaningful to us — from there I knew where I belonged.

In my spare time, I experimented with blog posts. For example, thinking about the future of computing and how it is related to the brain and deep learning. I experimented with writing guest blog posts for NVIDIA. I experimented with spontaneous blog posts about what is on my mind. I finished such a blog post in one morning, and I would consider it to be one of my best blog posts, despite the little effort I invested in it.

: Inline skating through the public park and along the lake in Lugano was a unique experience.

I got back into inline skating. Inline skating along the park and lake in Lugano was a unique experience. I will remember forever the early mornings with a deserted town, with mist on the mountains while skating with high speed along the still water and beautiful flowers.

: View from my friend’s place. I remember sitting at his balcony having BBQ and talking about how to view important problems in life from a perspective of psychology and computer science.

Also I used my extra time to gather more research experience with an internship at Microsoft Research in the US and a research internship at UCL in London. While already very valuable for my self-development, these experiences have been instrumental to my success in my PhD application.

: My father and I were hiking along a trail near Lugano with this view — that experience will always be with me as a vivid memory.

At the end of it, I can say that I met people from dozens of different countries with very different cultures. I lived in 4 different countries across 2 continents. I learned how to be a good friend and learned which people I belonged to. I learned what my place is in this world. I revived the joy in some activities that I enjoyed in the past and found new activities that I enjoyed. I made unique memories that will always be with me. All of these experiences and memories are at the very heart of the variability perspective. I could not have gotten all of this if I had chosen the program that offered more of the same or if I had submitted to the attitude of “my master’s degree is just a piece of paper.” So with your choice of grad school, you also have the power to choose the range of experiences that will shape you into the beautiful person that you will become.

Acknowledgements

This blog post features contributions from Gabriel Ilharco. I would like to thank Hattie Zhou, Nelson Liu, Noah Smith, Gabriel Ilharco, Mitchell Wortsman, Luke Zettlemoyer, Aditya Kusupati, Jungo Kasai, and Ofir Press for their valuable feedback on drafts of this blog post.

Update History

2022-03: Added sections on Research Style and Computational Resources.

The post How to Choose Your Grad School appeared first on Tim Dettmers.

On Creativity in Academia

Tim Dettmers — Tue, 03 Sep 2019 12:05:19 +0000

I recently had a discussion about creativity with a colleague. We were discussing music and how creative many bands and groups are. At the end of our conversation, my friend told me, half-sarcastic-half-serious, how much more creative the people in the music industry are than him and that he just cannot find good ideas in his area of research even though he tried so hard for such a long time. I was a bit surprised because I thought of him as someone very creative. However, it is not uncommon to hear scientists lament about their lack of creativity compared to academic superstars. I think about creativity in academia is a bit distorted and a straight view can help to feel less bad about one’s own creativity.

This blog post is part of a series of blog posts about scientific thinking in deep learning, natural language processing and science in general. I am currently on vacation in China, and I wanted to relax a bit by writing down some reflective blog posts which capture the thoughts that were lingering in my mind for weeks or months.

Are Theoretical Physicists Creative?

I think the paradox that very creative people think they are not creative is best demonstrated by looking at theoretical physicists and compare them to children. In psychological research, it is well known that children score much better on many tasks of divergent thinking than adults do: They do not think about the limitations of an object so that a brick which is used for building buildings is suddenly a tool for weight training, or a door stopper, or a paperweight, and so forth. If you ask people to build towers of spaghetti and marshmallows children do better than adults because they are not limited by what they think a structure of a tower should look like. But all of this is mere idea generation. Is this really creativity?

There is another famous case of similar creativity among physicists which might shine a light on what the boundary between idea generation and creativity in research is: The undergrad theory of everything. It is a common problem for academics in physics to be tortured by undergrads who just invented “a new theory of physics which can unify gravity and quantum mechanics”. The problem here is that the undergrads do not yet have the proper knowledge to understand the intricate relationships among equations to understand what is permissible and what is not. They see the brick as a door stopper, when in fact a brick is used for building buildings and paving walkways. An important part of creativity is to understand what are bad ideas — some physics undergrads think it is just about idea generation. Do not get me wrong, idea generation is important, but it is not the most important part of creativity in academia.

This can go to an extreme if you work in theoretical physics and other fields where ideas are severely constrained by proper thought. There are so many bad ideas and so few good ideas that nobody really is coming up with anything good anymore. However, it would be ludicrous to say that people like Edward Witten are not creative because he did not come up with any good ideas since string theory. Similarly, Albert Einstein labored for decades trying to unify gravity and quantum mechanics only to come up with nothing. Bertrand Russel would often take a sheet of paper in the morning and work on a logical problem and write down whenever he found a useful thought. Most often the paper was still blank in the evening. So if you see creativity as idea generation, Albeit Einstein and others should be seen as failures compared to the children that churn out ideas. This demonstrates that the view of creativity as idea generation is problematic.

One thing that has to be understood when thinking about creativity is that some fields of thought are highly constrained in terms of which ideas are valid. To come by a good idea is a very lengthy and labor-intensive process. Other fields, like music, are very free in their expression and you can take any two ideas which do not seem to be related at all, mash them together, and with a little bit of work you can make it sound nice. I am exaggerating, but you get the idea.

Some fields, like machine translations, are now more and more constrained and good ideas need a team of people equipped with large computational resources that collaborate effectively for a long time come up with, and verify an idea which will yield a tiny improvement. One can expect the constrains on ideas increase exponentially with time in any given sub-field — just like it did in experimental physics. However, while finding valid ideas is becoming exponentially more difficult these fields also spawn new sub-fields as offspring. In these new fields, it will be very easy to come up with new ideas since — similarly to the music industry — anything is valid. As the field progresses the idea space becomes more and more constraint and finding valid ideas is much more important than generating just any idea. If you work in an area which is very constrained, you should have more compassion with yourself. Creativity is not just about generating some imaginative ideas — it is more about finding strange ideas which are still valid.

“Not Coming Up with Good Ideas” is Essential for Creativity

Expertise is important and a requirement for creativity. You need to be able to understand what are valid ideas and which are not. The next step is to loosen up the boundaries between ideas that may seem unconnected at first glance. Psychological research says, that once one has one of these strange ideas it is important to hammer on it over and over to exhaustion. The idea will reshape itself from one form to the next and eventually, you will probably fail to come up with something reasonable that works. Science says, that this is normal and the further insights are made unconsciously. After you give up an idea, your unconscious mind is still in the process of piecing together the puzzle and you might arrive at something useful over time. With the next puzzle piece put into place by your unconscious mind, you might be able to make some progress on an idea which might lead to a working valid idea.

Many researchers fail in the creative process because they do not understand it well. They feel like failures if their ideas fail. But the process of hammering on ideas and not making any progress is the first part of creativity. Only if you know all the ways that do not work can you come up with the solutions that nobody else is seeing. The second step is often abandoning the idea for some time. Some researchers feel that if an idea did not work out and you abandon the idea you also failed and it is a sign of not having creativity. But this step can be a critical element of creativity. It is important to have phases in which you do not think about an idea so your unconscious mind can make the connections that your conscious mind cannot see. The next step is to pick up a failed idea and try again. The unconscious insights are revealed in this way and you might quickly have a way to get an idea to work.

Another problem with the creative process is that researchers often work on a single idea. Instead, it is much more effective to work on many ideas. One idea for you to work on actively, while the other ideas are in the back of your mind and provide enough material for your unconscious mind to churn on. These ideas do not need to be totally different from each other, just different enough to not bother your conscious mind while you work on another idea.

I think to have a sane creative process, it is essential to acknowledge and even embrace this long-winded exhausting struggle with multiple rounds of failure as an essential part of creativity.

Conclusion

Researchers are often very harsh critics of themselves in terms of creativity. They do not come up with good ideas or with too few ideas or their ideas do not work out. But this does not mean that you are not creative. Some fields of research are very constraint in what ideas are valid and it is expected that the raw quantity of ideas in these fields is low. Furthermore, making no progress and abandoning an idea to work on something else are essential parts of creativity and should be celebrated and embraced. The next time you fail to make progress and think about abandoning an idea you should give yourself a pat on the back — you just reached the first milestone to come up with a great idea!

The post On Creativity in Academia appeared first on Tim Dettmers.

Sparse Networks from Scratch: Faster Training without Losing Performance

Tim Dettmers — Thu, 11 Jul 2019 13:07:26 +0000

This blog post is about my work, Sparse Networks from Scratch: Faster Training without Losing Performance, with Luke Zettlemoyer on fast training of neural networks which we keep sparse throughout training. We show that by developing an algorithm, sparse momentum, we can initialize a neural network with sparse random weights and train it to dense performance levels — all while doing just a single training run. Furthermore, If we use optimized sparse convolution algorithms, we can speed up training between 3.5x for VGG to 12x for Wide Residual Networks. This stands in stark contrast to computationally expensive methods which require repetitive prune-and-retrain cycles as used by the Lottery Ticket Hypothesis (Frankle and Carbin, 2019) and other work. Thus we show that training sparse networks to dense performance levels does not require “winning the initialization lottery” but can be done reliably from random weights if combined with a method that moves weights around the network in a smart way. We call the paradigm that maintains sparsity throughout training while maintaining dense performance levels sparse learning. While this work shows that sparse learning is possible, future work holds the promise to train larger and deep networks on more data while requiring the same or less computational resources as current dense networks.

Why Sparse Learning?

A significant driver of progress in deep learning has been advances in computational resources. From 2010 to 2018 we saw an increase of 9700% in computational GPU performance. However, we can expect increases of just little more than 80% GPU performance in the next 5-8 years due to reaching the physical limits of semiconductor technology. What does a research world look like where we cannot make further improvements in computational power?

A glimpse of this comes from the natural language processing (NLP) community where pretrained language models like ELMO, GPT, BERT, GPT-2, Grover, and XL-Net dominate the entire field by outperforming other methods on most NLP tasks. These models are often rather simple: You train them on lots of documents, and the task is mainly to predict a word given a sequence of other words — a bit like doing a fill-in-the-blank puzzle. The catch? These models are so big that they take well in excess of 100 GPUs hours to train. This is particularly frustrating for academic researchers that want to understand these models but are unable to do so because they lack the computational resources that big companies have. To truly understand these massive pretrained language models, a primary goal should be to democratize the training of these models by developing more resourceful training procedures.

One way to achieve this is to look at the human brain for inspiration. The human brain consumes 1/10th of the energy of a GPU but is 10^9 times more powerful. What makes the brain so computational efficient? There are many reasons, but one reason is sparsity.

It has been found that the more neurons a primate brain has the fewer connections does the average neuron make with all other neurons (Herculano-Houzel et al., 2010). This is very much contrary to how we design deep neural networks, which is to connect every new neuron in a layer with all neurons in the previous layer. We already understand how to compress a fully trained dense network to a sparse network (Han et al., 2015), but there has been little work on how to do this successfully if one starts from a sparse network which we keep sparse during training. How do we do this?

Sparse Momentum: An Efficient Way to Train Sparse Networks

This section explains the sparse momentum algorithm from intutiton up to the full algorithm.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?fit=300%2C145&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?fit=1024%2C493&ssl=1" class="wp-image-779 size-full" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=1096%2C528&ssl=1" alt="Sparse Momentum determines where to grow new weights in a sparse network by looking at the weighted average of recent gradients (momentum) to find weights and layers which reduce the error consistently. (1) We determine the importance of each layer according to the mean momentum magnitude. (2) For each layer, we remove the 50\% of the smallest weights. (3) We then redistribute the weights across layers according to layer importance. Within a layer we grow weights where the momentum magnitude is large." width="1096" height="528" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?w=1096&ssl=1 1096w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=300%2C145&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=768%2C370&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=1024%2C493&ssl=1 1024w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Figure 1: Sparse Momentum determines where to grow new weights in a sparse network by looking at the weighted average of recent gradients (momentum) to find weights and layers which reduce the error consistently. (1) We determine the importance of each layer according to the mean momentum magnitude. (2) For each layer, we remove the 50% of the smallest weights. (3) We then redistribute the weights across layers according to layer importance. Within a layer we grow weights where the momentum magnitude is large.

What is the Main Quality of Good Sparse Learning Algorithms?

In sparse learning, the most important thing is to use every single weight in a neural network as effectively as possible. If you define “effectiveness” as “reducing the error,” then we have an obvious perspective on how we can proceed. We need to find a measure which describes how effective a weight is at reducing the error and remove all weights which do not. Once we removed weights, we want to regrow new weights in locations which we think are promising at reducing the error in the future.

If we look at the gradient of the error with respect to the weight, we actually have precisely such a measure. However, if we look at successive gradients, we find that gradients can wildly oscillate. For example if you have a neural network which classifies handwritten digits 0 to 9 then a weight might be good at detecting a straight line at the top and it might help to reduce the error for the numbers 5, 7 but then it might not help or even be detrimental for numbers 0, 1, 2, 3, 6, 8, 9. Instead, a weight which detects a curvy pattern in the top right might help for 0, 2, 3, 8, 9 and as such we would expect that this weight reduces the error more consistently over time than the “straight line at the top” weight. How can we detect such promising weights in a neural network automatically?

Momentum: Finding Weights that Reduce the Error Consistently

If you take the north pole to be a local minimum and a compass needle the gradient towards the local minimum, then you can simulate stochastic gradient descent updates by shaking the compass wildly to spin the compass needle. With every time the needle passes the north pole it will slow down and line-up more and more with the north pole, however, due to the spin it will still “overshoot” that direction. So it might be unclear where the north pole is from two or three measurements while the needle is still moving back and forth. However, if you take the average directions — one time the needle is a bit to the left of the north pole, another time it is more to the right — then these deviations cancel out, and you will immediately get a direction which is very close to the real north pole.

This is the main idea behind the momentum optimization technique: We average successive gradients to get a better estimate of the direction of the local minimum. Similarly to the compass needle, which gets more and more accurate over time as it slows down, we want to weight more recent gradient directions in stochastic gradient descent more highly. One way to do this is to assign a weighted average where we assign a much larger weight to the current gradient and a small weight to the previous gradients — this is called exponential smoothing. Through exponential smoothing the gradients of the weight we receive a weighted gradient matrix — this matrix is the momentum matrix which gives momentum optimization its name. With this measure, we can identify which are the weights which reduce the error consistently.

Redistributing Weights: The Mean Momentum Magnitude of a Layer

From here we make the first important observation for our sparse momentum algorithm: If the momentum of a weight indicates how much it reduces the error consistently, then the mean momentum magnitude of all the weights in a layer should indicate how much each layer is reducing the error on average. We take the magnitude because two different weights might consistently go into a negative direction and a positive direction. By taking the mean momentum magnitude of layers, we can easily compare how effective the average weight in each layer is. This enables to say, for example, that a weight in a convolutional layer A is on average 1/3 as effective at reducing the error as the average weight in fully connected layer B, or vice versa. This method enables us to redistribute weights effectively: if we find “useless” weights, we now know precisely in which layer to put it — but where to put them exactly within a layer?

Which Weights Should be Removed? Where to Regrow them?

The next two problems are more straightforward: Which are the most useless weights? Where do we grow weights within a layer? The first problem is a common problem in neural network compression research, where one often prunes the weights with the smallest magnitude. Why does this make sense? If we assume all weights receive on average inputs of similar magnitude — a reasonable assumption if one uses batch normalization — then weights with small magnitudes make the smallest difference in the activation of a neuron. As such, removing them should change the predictive performance of our networks by the smallest amount.

Once we removed weights and redistributed them to weight-effective layers as measured by the mean momentum magnitude of a layer, we need to decide where exactly to grow them within a layer. One possible solution becomes apparent if we ask: “Which two unconnected neurons would reduce the error consistently if we connect them?” The answer to this question would again point to the momentum magnitude. This time, however, we want to look at the momentum magnitude of “missing” or zero-valued weights, that is, we want to look at those weights which have been excluded from training before. Thus we grow weights in locations where missing weights have the largest momentum magnitude. This completes the sparse momentum algorithm, which depicted in Figure 1.

Results

The results are quite impressive! We compared against compression algorithms on MNIST, where sparse momentum outperforms most other methods. This is a pretty good result given that compression methods start from a dense network and usually retrain repetitively while we train a sparse network from scratch! Another impressive result is that we can match or even exceed the performance of dense networks by using 20% of weights (80% sparsity). On CIFAR-10, we compare against Single-shot Network Pruning which is designed for simplicity and not performance — so it is not surprising that sparse momentum does better. However, what is interesting is that we can train both VGG16-D (a version of VGG16 with two fully connected layers) and Wide Residual Network (WRN) 16-10 (16 layers deep and very wide WRN) to dense performance levels with just 5% of weights. For other networks, sparse momentum comes close to dense performance levels. Furthermore, as I will show later, with an optimized sparse convolution algorithm, we would be able to train a variety of networks to yield the same performance levels while training between 3.0-5.6x faster!

Sparse Momentum results compared to neural network compression methods on MNIST for LeNet-300-100 and LeNet-5 Caffe.

ImageNet results for Sparse momentum and related methods. For the models that are not fully sparse, the first convolution and all downsample residual connections are dense from the start of training. In the fully sparse setting, all layers are sparse. Sparse momentum works better than other methods and works almost equally well if all the weights are sparse. This indicates that sparse momentum is efficient at finding important layers which require a high density.

On ImageNet, we are not able to reach dense performance levels, which indicates that there is room to improve sparse momentum. However, we can demonstrate that sparse momentum has a clear lead compared to other methods that maintain sparse weights throughout training.

Speedups

The main promise of sparse learning was to accelerate training — were we successful? Yes — and no. Sparse momentum accelerates training efficiently if we measure possible speedups for sparse convolution, but since sparse networks were only very recently used for training, no optimized sparse convolution algorithms exist for the GPU — at least not for the fine-grained sparse patterns of weights as exhibited by sparse momentum.

As such, we divide the speedups into two groups: Possible speedups which could be achieved if sparse convolution algorithms would exist, and speedups which we can achieve today with standard dense convolutional algorithms. How can dense convolutions help for sparse networks?

If we look at the sparsity pattern of our network, we have the case where a convolutional channel is entirely empty — a convolutional filter full of zeros! If this happens, we can remove the channel from the computation without changing the results of the convolution and thus gain speedups.

Sparse momentum can replicate dense performance levels for a range of networks with a fraction of the weights thus leading to speedups.

However, if we look at the speedups, we see there is a marked difference between sparse convolution and dense convolution speedups. This clearly shows the need for optimized sparse convolution algorithms for the GPU.

Why does Sparse Learning Work?

Some of our sparse networks trained with sparse momentum matched the performance levels of dense networks with just 5% of weights. What makes these 5% of weights so efficient that they can match a neural network with 20 times as many weights?

To look into this question, we looked at how the features of sparse networks compare to dense networks. Low-level features might include things like edge detectors. Mid-level features might be things like wheels, noses, eyes, paws. High-level features might be the “face” of a car, a cat face, a fridge door, and so forth.

To reduce features to numbers we look at convolutional channels — the equivalent to a “neuron” in a convolutional network — and how useful the channel is to classes in the dataset. Edge detectors should be useful to almost all classes in the dataset — in other words, they should have a low level of class-specialization. Mid-level features like eyes should be useful to—some classes such as cats, dogs, and humans. High-level features should be useful to a few selected classes — they are highly class-specialized.

Figure 6: Class-specialization histograms for sparse and dense networks for AlexNet, VGG16 and WRN 28-2.

What we find is that on average, sparse networks learn features which are useful to a broader range of classes — they learn more general features. This might be a possible explanation of why sparse networks can match the performance of dense networks with as few as 5% weights.

The Future of Sparse Learning

I believe sparse learning has a very bright future because (1) GPUs will stagnate in performance over the next years, (2) specialized processors for sparse workloads, Graphcore processors, are around the corner. Graphcore processors store an entire network in its 300 MB cache and accelerate it by a factor of roughly 100x. This means, if we can compress a network to 300 MB during training, then we will have 100x faster training overall. Training a ResNet-50 on ImageNet would then take only roughly 15 minutes using one Graphcore processor. With sparse learning, the 300 MB limit will be in reach without a problem.

My prediction is that the first research team that can train a sparse neural network on a Graphcore processor successfully will unlock an entirely new level of artificial intelligence.

Besides this, another challenge is to apply sparse learning algorithms to natural language processing (NLP). Unsurprisingly, my experimentation on transformers for natural language processing tasks show that sparse learning is much more difficult in NLP compared to computer vision — lots of work to do!

Try Sparse Momentum with Your Own Model in 10 Lines of Code!

Figure 7: Example of a generic sparse learning script which you can use for your own model. With my sparselearning library it is easy to use sparse momentum: (1) Import the library, (2) add the parser options, (3) wrap your model with the Masking class, (4) apply mask instead of optimizer, (5) apply sparse momentum at the end of epoch. The library is also easily extendable with your own sparse learning algorithms for growth, pruning, or redistribution — all it takes is a few lines of code!

To make sparse learning accessible to everyone I developed a sparse learning library which allows the easy application of existing algorithms like sparse momentum to your own models — it can be done in less than 10 lines of code. The library is also designed to make it very easy to add your own sparse learning methods. You find my sparse learning library on GitHub.

Questions?

For questions, I prefer if you post them below if they are simple and straightforward. If you have a more formal question regarding our work that requires careful answers, you can post an the question as a GitHub issue — I will try to answer as timely as possible.

Acknowledgements

I thank Luke Zettlemoyer for feedback on an early draft of this blog post.

References

Frankle, J. and Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR 2019.

Han, S., Pool, J., Tran, J., and Dally, W. (2015). Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages
1135—1143.

Herculano-Houzel, S., Mota, B., Wong, P., and Kaas, J.H. (2010). Connectivity-driven white matter scaling and folding in primate cerebral cortex. In Proceedings of the National Academy of Sciences of the United States of America, 107 44:19008—13.

The post Sparse Networks from Scratch: Faster Training without Losing Performance appeared first on Tim Dettmers.

A Full Hardware Guide to Deep Learning

Tim Dettmers — Sun, 16 Dec 2018 18:25:41 +0000

Deep Learning is very computationally intensive, so you will need a fast CPU with many cores, right? Or is it maybe wasteful to buy a fast CPU? One of the worst things you can do when building a deep learning system is to waste money on hardware that is unnecessary. Here I will guide you step by step through the hardware you will need for a cheap high-performance system.

Over the years, I build a total of 7 different deep learning workstations and despite careful research and reasoning, I made my fair share of mistake in selecting hardware parts. In this guide, I want to share my experience that I gained over the years so that you do not make the same mistakes that I did before.

The blog post is ordered by mistake severity. This means the mistakes where people usually waste the most money come first.

GPU

This blog post assumes that you will use a GPU for deep learning. If you are building or upgrading your system for deep learning, it is not sensible to leave out the GPU. The GPU is just the heart of deep learning applications – the improvement in processing speed is just too huge to ignore.

I talked at length about GPU choice in my GPU recommendations blog post, and the choice of your GPU is probably the most critical choice for your deep learning system. There are three main mistakes that you can make when choosing a GPU: (1) bad cost/performance, (2) not enough memory, (3) poor cooling.

For good cost/performance, I generally recommend an RTX 2070 or an RTX 2080 Ti. If you use these cards you should use 16-bit models. Otherwise, GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from eBay are fair choices and you can use these GPUs with 32-bit (but not 16-bit).

Be careful about the memory requirements when you pick your GPU. RTX cards, which can run in 16-bits, can train models which are twice as big with the same memory compared to GTX cards. As such RTX cards have a memory advantage and picking RTX cards and learn how to use 16-bit models effectively will carry you a long way. In general, the requirements for memory are roughly the following:

Research that is hunting state-of-the-art scores: >=11 GB
Research that is hunting for interesting architectures: >=8 GB
Any other research: 8 GB
Kaggle: 4 – 8 GB
Startups: 8 GB (but check the specific application area for model sizes)
Companies: 8 GB for prototyping, >=11 GB for training

Another problem to watch out for, especially if you buy multiple RTX cards is cooling. If you want to stick GPUs into PCIe slots which are next to each other you should make sure that you get GPUs with a blower-style fan. Otherwise you might run into temperature issues and your GPUs will be slower (about 30%) and die faster.

Suspect line-up
Can you identify the hardware part which is at fault for bad performance? One of these GPUs? Or maybe it is the fault of the CPU after all?

RAM

The main mistakes with RAM is to buy RAM with a too high clock rate. The second mistake is to buy not enough RAM to have a smooth prototyping experience.

Needed RAM Clock Rate

RAM clock rates are marketing stints where RAM companies lure you into buying “faster” RAM which actually yields little to no performance gains. This is best explained by “Does RAM speed REALLY matter?” video on RAM von Linus Tech Tips.

Furthermore, it is important to know that RAM speed is pretty much irrelevant for fast CPU RAM->GPU RAM transfers. This is so because (1) if you used pinned memory, your mini-batches will be transferred to the GPU without involvement from the CPU, and (2) if you do not use pinned memory the performance gains of fast vs slow RAMs is about 0-3% — spend your money elsewhere!

RAM Size

RAM size does not affect deep learning performance. However, it might hinder you from executing your GPU code comfortably (without swapping to disk). You should have enough RAM to comfortable work with your GPU. This means you should have at least the amount of RAM that matches your biggest GPU. For example, if you have a Titan RTX with 24 GB of memory you should have at least 24 GB of RAM. However, if you have more GPUs you do not necessarily need more RAM.

The problem with this “match largest GPU memory in RAM” strategy is that you might still fall short of RAM if you are processing large datasets. The best strategy here is to match your GPU and if you feel that you do not have enough RAM just buy some more.

A different strategy is influenced by psychology: Psychology tells us that concentration is a resource that is depleted over time. RAM is one of the few hardware pieces that allows you to conserve your concentration resource for more difficult programming problems. Rather than spending lots of time on circumnavigating RAM bottlenecks, you can invest your concentration on more pressing matters if you have more RAM. With a lot of RAM you can avoid those bottlenecks, save time and increase productivity on more pressing problems. Especially in Kaggle competitions, I found additional RAM very useful for feature engineering. So if you have the money and do a lot of pre-processing then additional RAM might be a good choice. So with this strategy, you want to have more, cheap RAM now rather than later.

CPU

The main mistake that people make is that people pay too much attention to PCIe lanes of a CPU. You should not care much about PCIe lanes. Instead, just look up if your CPU and motherboard combination supports the number of GPUs that you want to run. The second most common mistake is to get a CPU which is too powerful.

CPU and PCI-Express

People go crazy about PCIe lanes! However, the thing is that it has almost no effect on deep learning performance. If you have a single GPU, PCIe lanes are only needed to transfer data from your CPU RAM to your GPU RAM quickly. However, an ImageNet batch of 32 images (32x225x225x3) and 32-bit needs 1.1 milliseconds with 16 lanes, 2.3 milliseconds with 8 lanes, and 4.5 milliseconds with 4 lanes. These are theoretic numbers, and in practice you often see PCIe be twice as slow — but this is still lightning fast! PCIe lanes often have a latency in the nanosecond range and thus latency can be ignored.

Putting this together we have for an ImageNet mini-batch of 32 images and a ResNet-152 the following timing:

Forward and backward pass: 216 milliseconds (ms)
16 PCIe lanes CPU->GPU transfer: About 2 ms (1.1 ms theoretical)
8 PCIe lanes CPU->GPU transfer: About 5 ms (2.3 ms)
4 PCIe lanes CPU->GPU transfer: About 9 ms (4.5 ms)

Thus going from 4 to 16 PCIe lanes will give you a performance increase of roughly 3.2%. However, if you use PyTorch’s data loader with pinned memory you gain exactly 0% performance. So do not waste your money on PCIe lanes if you are using a single GPU!

When you select CPU PCIe lanes and motherboard PCIe lanes make sure that you select a combination which supports the desired number of GPUs. If you buy a motherboard that supports 2 GPUs, and you want to have 2 GPUs eventually, make sure that you buy a CPU that supports 2 GPUs, but do not necessarily look at PCIe lanes.

PCIe Lanes and Multi-GPU Parallelism

Are PCIe lanes important if you train networks on multiple GPUs with data parallelism? I have published a paper on this at ICLR2016, and I can tell you if you have 96 GPUs then PCIe lanes are really important. However, if you have 4 or fewer GPUs this does not matter much. If you parallelize across 2-3 GPUs, I would not care at all about PCIe lanes. With 4 GPUs, I would make sure that I can get a support of 8 PCIe lanes per GPU (32 PCIe lanes in total). Since almost nobody runs a system with more than 4 GPUs as a rule of thumb: Do not spend extra money to get more PCIe lanes per GPU — it does not matter!

Needed CPU Cores

To be able to make a wise choice for the CPU we first need to understand the CPU and how it relates to deep learning. What does the CPU do for deep learning? The CPU does little computation when you run your deep nets on a GPU. Mostly it (1) initiates GPU function calls, (2) executes CPU functions.

By far the most useful application for your CPU is data preprocessing. There are two different common data processing strategies which have different CPU needs.

The first strategy is preprocessing while you train:

Loop:

Load mini-batch
Preprocess mini-batch
Train on mini-batch

The second strategy is preprocessing before any training:

Preprocess data
Loop:
1. Load preprocessed mini-batch
2. Train on mini-batch

For the first strategy, a good CPU with many cores can boost performance significantly. For the second strategy, you do not need a very good CPU. For the first strategy, I recommend a minimum of 4 threads per GPU — that is usually two cores per GPU. I have not done hard tests for this, but you should gain about 0-5% additional performance per additional core/GPU.

For the second strategy, I recommend a minimum of 2 threads per GPU — that is usually one core per GPU. You will not see significant gains in performance when you have more cores if you are using the second strategy.

Needed CPU Clock Rate (Frequency)

When people think about fast CPUs they usually first think about the clock rate. 4GHz is better than 3.5GHz, or is it? This is generally true for comparing processors with the same architecture, e.g. “Ivy Bridge”, but it does not compare well between processors. Also, it is not always the best measure of performance.

In the case of deep learning there is very little computation to be done by the CPU: Increase a few variables here, evaluate some Boolean expression there, make some function calls on the GPU or within the program – all these depend on the CPU core clock rate.

While this reasoning seems sensible, there is the fact that the CPU has 100% usage when I run deep learning programs, so what is the issue here? I did some CPU core rate underclocking experiments to find out.

CPU underclocking on MNIST and ImageNet: Performance is measured as time taken on 100 epochs MNIST or half an epoch on ImageNet with different CPU core clock rates, where the maximum clock rate is taken as a base line for each CPU. For comparison: Upgrading from a GTX 580 to a GTX Titan is about +20% performance; from GTX Titan to GTX 980 another +30% performance; GPU overclocking yields about +5% performance for any GPU

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/cpu_underclocking2.png?fit=300%2C202&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/cpu_underclocking2.png?fit=603%2C406&ssl=1" class="wp-image-161" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/cpu_underclocking2.png?resize=804%2C541" alt="CPU underclocking on MNIST and ImageNet: Performance is measured as time taken on 100 epochs MNIST or half an epoch on ImageNet with different CPU core clock rates, where the maximum clock rate is taken as a base line for each CPU. For comparison: Upgrading from a GTX 580 to a GTX Titan is about +20% performance; from GTX Titan to GTX 980 another +30% performance; GPU overclocking yields about +5% performance for any GPU" width="804" height="541" data-recalc-dims="1" />

CPU underclocking on MNIST and ImageNet: Performance is measured as time taken on 200 epochs MNIST or a quarter epoch on ImageNet with different CPU core clock rates, where the maximum clock rate is taken as a baseline for each CPU. For comparison: Upgrading from a GTX 680 to a GTX Titan is about +15% performance; from GTX Titan to GTX 980 another +20% performance; GPU overclocking yields about +5% performance for any GPU

Note that these experiments are on a hardware that is dated, however, these results should still be the same for modern CPUs/GPUs.

Hard drive/SSD

The hard drive is not usually a bottleneck for deep learning. However, if you do stupid things it will hurt you: If you read your data from disk when they are needed (blocking wait) then a 100 MB/s hard drive will cost you about 185 milliseconds for an ImageNet mini-batch of size 32 — ouch! However, if you asynchronously fetch the data before it is used (for example torch vision loaders), then you will have loaded the mini-batch in 185 milliseconds while the compute time for most deep neural networks on ImageNet is about 200 milliseconds. Thus you will not face any performance penalty since you load the next mini-batch while the current is still computing.

However, I recommend an SSD for comfort and productivity: Programs start and respond more quickly, and pre-processing with large files is quite a bit faster. If you buy an NVMe SSD you will have an even smoother experience when compared to a regular SSD.

Thus the ideal setup is to have a large and slow hard drive for datasets and an SSD for productivity and comfort.

Power supply unit (PSU)

Generally, you want a PSU that is sufficient to accommodate all your future GPUs. GPUs typically get more energy efficient over time; so while other components will need to be replaced, a PSU should last a long while so a good PSU is a good investment.

You can calculate the required watts by adding up the watt of your CPU and GPUs with an additional 10% of watts for other components and as a buffer for power spikes. For example, if you have 4 GPUs with each 250 watts TDP and a CPU with 150 watts TDP, then you will need a PSU with a minimum of 4×250 + 150 + 100 = 1250 watts. I would usually add another 10% just to be sure everything works out, which in this case would result in a total of 1375 Watts. I would round up in this case an get a 1400 watts PSU.

One important part to be aware of is that even if a PSU has the required wattage, it might not have enough PCIe 8-pin or 6-pin connectors. Make sure you have enough connectors on the PSU to support all your GPUs!

Another important thing is to buy a PSU with high power efficiency rating – especially if you run many GPUs and will run them for a longer time.

Running a 4 GPU system on full power (1000-1500 watts) to train a convolutional net for two weeks will amount to 300-500 kWh, which in Germany – with rather high power costs of 20 cents per kWh – will amount to 60-100€ ($66-111). If this price is for a 100% efficiency, then training such a net with an 80% power supply would increase the costs by an additional 18-26€ – ouch! This is much less for a single GPU, but the point still holds – spending a bit more money on an efficient power supply makes good sense.

Using a couple of GPUs around the clock will significantly increase your carbon footprint and it will overshadow transportation (mainly airplane) and other factors that contribute to your footprint. If you want to be responsible, please consider going carbon neutral like the NYU Machine Learning for Language Group (ML2) — it is easy to do, cheap, and should be standard for deep learning researchers.

CPU and GPU Cooling

Cooling is important and it can be a significant bottleneck which reduces performance more than poor hardware choices do. You should be fine with a standard heat sink or all-in-one (AIO) water cooling solution for your CPU, but what for your GPU you will need to make special considerations.

Air Cooling GPUs

Air cooling is safe and solid for a single GPU or if you have multiple GPUs with space between them (2 GPUs in a 3-4 GPU case). However, one of the biggest mistakes can be made when you try to cool 3-4 GPUs and you need to think carefully about your options in this case.

Modern GPUs will increase their speed – and thus power consumption – up to their maximum when they run an algorithm, but as soon as the GPU hits a temperature barrier – often 80 °C – the GPU will decrease the speed so that the temperature threshold is not breached. This enables the best performance while keeping your GPU safe from overheating.

However, typical pre-programmed schedules for fan speeds are badly designed for deep learning programs, so that this temperature threshold is reached within seconds after starting a deep learning program. The result is a decreased performance (0-10%) which can be significant for multiple GPUs (10-25%) where the GPU heat up each other.

Since NVIDIA GPUs are first and foremost gaming GPUs, they are optimized for Windows. You can change the fan schedule with a few clicks in Windows, but not so in Linux, and as most deep learning libraries are written for Linux this is a problem.

The only option under Linux is to use to set a configuration for your Xorg server (Ubuntu) where you set the option “coolbits”. This works very well for a single GPU, but if you have multiple GPUs where some of them are headless, i.e. they have no monitor attached to them, you have to emulate a monitor which is hard and hacky. I tried it for a long time and had frustrating hours with a live boot CD to recover my graphics settings – I could never get it running properly on headless GPUs.

The most important point of consideration if you run 3-4 GPUs on air cooling is to pay attention to the fan design. The “blower” fan design pushes the air out to the back of the case so that fresh, cooler air is pushed into the GPU. Non-blower fans suck in air in the vincity of the GPU and cool the GPU. However, if you have multiple GPUs next to each other then there is no cool air around and GPUs with non-blower fans will heat up more and more until they throttle themselves down to reach cooler temperatures. Avoid non-blower fans in 3-4 GPU setups at all costs.

Water Cooling GPUs For Multiple GPUs

Another, more costly, and craftier option is to use water cooling. I do not recommend water cooling if you have a single GPU or if you have space between your two GPUs (2 GPUs in 3-4 GPU board). However, water cooling makes sure that even the beefiest GPU stay cool in a 4 GPU setup which is not possible when you cool with air. Another advantage of water cooling is that it operates much more silently, which is a big plus if you run multiple GPUs in an area where other people work. Water cooling will cost you about $100 for each GPU and some additional upfront costs (something like $50). Water cooling will also require some additional effort to assemble your computer, but there are many detailed guides on that and it should only require a few more hours of time in total. Maintenance should not be that complicated or effortful.

A Big Case for Cooling?

I bought large towers for my deep learning cluster, because they have additional fans for the GPU area, but I found this to be largely irrelevant: About 2-5 °C decrease, not worth the investment and the bulkiness of the cases. The most important part is really the cooling solution directly on your GPU — do not select an expensive case for its GPU cooling capability. Go cheap here. The case should fit your GPUs but thats it!

Conclusion Cooling

So in the end it is simple: For 1 GPU air cooling is best. For multiple GPUs, you should get blower-style air cooling and accept a tiny performance penalty (10-15%), or you pay extra for water cooling which is also more difficult to setup correctly and you have no performance penalty. Air and water cooling are all reasonable choices in certain situations. I would however recommend air cooling for simplicity in general — get a blower-style GPU if you run multiple GPUs. If you want to user water cooling try to find all-in-one (AIO) water cooling solutions for GPUs.

Motherboard

Your motherboard should have enough PCIe ports to support the number of GPUs you want to run (usually limited to four GPUs, even if you have more PCIe slots); remember that most GPUs have a width of two PCIe slots, so buy a motherboard that has enough space between PCIe slots if you intend to use multiple GPUs. Make sure your motherboard not only has the PCIe slots, but actually supports the GPU setup that you want to run. You can usually find information in this if you search your motherboard of choice on newegg and look at PCIe section on the specification page.

Computer Case

When you select a case, you should make sure that it supports full length GPUs that sit on top of your motherboard. Most cases support full length GPUs, but you should be suspicious if you buy a small case. Check its dimensions and specifications; you can also try a google image search of that model and see if you find pictures with GPUs in them.

If you use custom water cooling, make sure your case has enough space for the radiators. This is especially true if you use water cooling for your GPUs. The radiator of each GPU will need some space — make sure your setup actually fits into the GPU.

Monitors

I first thought it would be silly to write about monitors also, but they make such a huge difference and are so important that I just have to write about them.

The money I spent on my 3 27 inch monitors is probably the best money I have ever spent. Productivity goes up by a lot when using multiple monitors. I feel desperately crippled if I have to work with a single monitor. Do not short-change yourself on this matter. What good is a fast deep learning system if you are not able to operate it in an efficient manner?

Typical layout when I do deep learning: Left: Papers, google searcheres, gmail, stackoverflow threads; middle: Code; right: Output windows, R, folders, systems monitors, GPU monitors, to-do list, and other small applications.

" data-image-caption="" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/2015-03-04-13-58-10.jpg?fit=300%2C169&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/2015-03-04-13-58-10.jpg?fit=1024%2C576&ssl=1" class="wp-image-123 size-full" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/2015-03-04-13-58-10.jpg?resize=700%2C394" alt="2015-03-04 13.58.10" width="700" height="394" data-recalc-dims="1" />

Typical monitor layout when I do deep learning: Left: Papers, Google searches, gmail, stackoverflow; middle: Code; right: Output windows, R, folders, systems monitors, GPU monitors, to-do list, and other small applications.

Some words on building a PC

Many people are scared to build computers. The hardware components are expensive and you do not want to do something wrong. But it is really simple as components that do not belong together do not fit together. The motherboard manual is often very specific how to assemble everything and there are tons of guides and step by step videos which guide you through the process if you have no experience.

The great thing about building a computer is, that you know everything that there is to know about building a computer when you did it once, because all computer are built in the very same way – so building a computer will become a life skill that you will be able to apply again and again. So no reason to hold back!

Conclusion / TL;DR

GPU: RTX 2070 or RTX 2080 Ti. GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from eBay are good too!
CPU: 1-2 cores per GPU depending how you preprocess data. > 2GHz; CPU should support the number of GPUs that you want to run. PCIe lanes do not matter.

RAM:
– Clock rates do not matter — buy the cheapest RAM.
– Buy at least as much CPU RAM to match the RAM of your largest GPU.
– Buy more RAM only when needed.
– More RAM can be useful if you frequently work with large datasets.

Hard drive/SSD:
– Hard drive for data (>= 3TB)
– Use SSD for comfort and preprocessing small datasets.

PSU:
– Add up watts of GPUs + CPU. Then multiply the total by 110% for required Wattage.
– Get a high efficiency rating if you use a multiple GPUs.
– Make sure the PSU has enough PCIe connectors (6+8pins)

Cooling:
– CPU: get standard CPU cooler or all-in-one (AIO) water cooling solution
– GPU:
– Use air cooling
– Get GPUs with “blower-style” fans if you buy multiple GPUs
– Set coolbits flag in your Xorg config to control fan speeds

Motherboard:
– Get as many PCIe slots as you need for your (future) GPUs (one GPU takes two slots; max 4 GPUs per system)

Monitors:
– An additional monitor might make you more productive than an additional GPU.

Update 2018-12-14: Reworked entire blog post with up-to-date recommendations.
Update 2015-04-22: Removed recommendation for GTX 580

The post A Full Hardware Guide to Deep Learning appeared first on Tim Dettmers.

Machine Learning PhD Applications — Everything You Need to Know

Tim Dettmers — Mon, 26 Nov 2018 19:13:57 +0000

I studied in depth how to be successful in my PhD applications and it paid off: I got admitted to Stanford, University of Washington, UCL, CMU, and NYU. This blog post is a mish-mash of how to proceed in your PhD applications from A to Z. It discusses what is important and what is not. It discusses application materials like the statement of purpose (SoP) and how to make sense of these application materials.

There are some excellent sources out there on this topic and it is worth stopping for a second and understand what this blog post will give you and what other sources can give you. This blog post is mainly focused on PhD applications for deep learning and related fields like natural language processing, computer vision, reinforcement learning, and other sub-fields of deep learning. This blog post assumes that you already have a relatively strong profile, meaning you probably have already one or multiple publications under your belt and you worked with more than one person on research. This blog post is designed to help you optimize your chance for success for top programs.

If you seek more general information for PhD admissions, I recommend reading all the most highly voted questions and answers from Academia StackExchange. Other important sources are Applying to Ph.D. Programs in Computer Science which is a detailed write-up of the full admission process as viewed by CMU faculty. A similar but more concise source — in particular, relevant for good but not strong candidates — is the blog post Reflecting on CS Graduate Admissions which is again by CMU faculty. Less useful, but a quick read is the negative view of How to Write a Bad Statement for a Computer Science Ph.D. Admissions Application.

This blog post will first define what is important in PhD applications. Then we will dive into the application materials and how to think about these. Then I will talk a bit about the application process. The final section of the main part of this blog post will be on selecting schools — which schools are too good or too bad for me? After that, I will close with a Q&A section which was drawn from questions on Twitter. I will update this Q&A section periodically. If you have some questions regarding the application process, please leave a comment and I try to get back to you.

Source: PhD Comics

Understanding What Makes a Strong PhD Application

The most important factor that determines admission at any research university is research potential: How likely are you to become a great researcher? The main direct indicators for this are in order of importance:

Recommendations: Respected professors speak highly of you. Personal connections are important.
Research experience: You did successful research before. Measured in publications, first-authorship, and prestige of conference where you published.

Other indirect factors can help sometimes if they are exceptional, but usually, only the first two factors, recommendations, and research experience count. In order of importance:

Undergraduate university name: Some universities select aggressively for this, some others do not care so much.
Employer name: It is common that students are admitted that were previously employed in finance or at companies such as Google, Facebook, etcetera.
Smarts: Perfect GPA, perfect GRE is somewhat correlated with intelligence (or at least with how fast you can learn and understand).
Grit / Conscientiousness: You do well under continuous rejection, disappointment, and failure. If you faced and have overcome difficulties you might want to include your story in the statement of purpose.
Accomplishment: You won Math or CS competitions.
Recognition: You won prestigious scholarships/fellowships.
Good at math or engineering: You developed or contributed to open source projects. You worked with research code.
Heritage: Parents are professors.

Understanding Application Materials

Understanding Recommendation Letters

For recommendation letters, one could devise four categories: Strong, Good, Weak, and Bad. Note that the main thing that admission committees look for in recommendation letters are indicators of research potential. This section has the main purpose of making you aware of what constitutes a good or strong letter and based on this information it might be easier for you to select letter writers.

Signs of a Bad Recommendation Letter

Your letter writer knows you and writes bad things about you. Especially in the US anything even slightly critical is very bad.
Your letter writer does not know you (you had a class with her but you left no impression).
Your letter is short and only states that you did well in class.

Signs of a Weak Recommendation Letter

Your letter writer knows you from class only.
Your letter writer is favorable, but can only write about achievements in class: Great project work in class; part of lively, interesting discussions in class.
The letter writer does not comment on your research.
The letter writer is not known by the admission committee nor by potential advisors.

Signs of a Good Recommendation Letter

The name of the letter writer is known by parts of the admission committee.
The letter writer’s name and work are known by at least one potential advisor mentioned in the statement of purpose.
The letter writer worked with you on research.
The letter writer mentions your excellent research abilities in anecdotes that demonstrate your creativity, commitment, persistence and research skills in general.
The letter writer writes about how you published your research.
The letter writer comments about research done outside of her lab.

Signs of a Strong Recommendation Letter

US-style recommendation letter: The achievements are oozing through the paper. Everything is very much overdone, that is simple things become grand achievements.
The letter writer has an excellent command of English.
The letter writer is personally known by at least one potential advisor mentioned in the statement of purpose.
The letter writer is known for making excellent recommendations (previously recommended students do very well).
The letter writer mentions your excellent research abilities in anecdotes that demonstrate your creativity, commitment, persistence and research skills in general.
The letter writer mentions your abilities which help indirectly with research (engineering skills, presentation skills, interpersonal skills) and wraps these skills into anecdotes.
The letter writer comments about research done outside of her lab.

Note a few things:

Anecdotes are important because the show that the letter writer really knows you. They also read much better. Stories are more interesting than checklists.
The letter does not need to contain everything listed here to be considered “bad” or “strong” and so forth. Recommendation letters are complicated.
If you select recommendation letters it can make sense to have some diversity among letters that highlight different strengths. One strong letter on research skills, a good letter on engineering skills (internship), and a good letter on performance in class/project work is a great combination. This combination is better than a strong letter on research, a good letter on research, and a weak letter on research.
Please see more details about the process of asking about recommendation letters below.

Understanding Publications

Author Position

Publications are direct evidence for research experience and research skill. If you published as a first author, people know that you did most of the work. If you published as a second author, people know that you did a good portion of the work (25%-50%). If your name is the third or later, your contribution is discounted, but you probably went through the entire research process towards publication and gained a good amount of research experience. If you published a couple of first author papers a third author paper looks very good: It shows that you can work in a team.

Prestige of Venue

If you published your work at a respectable conference, people know that: (1) Your work is high quality; (2) your work can be trusted; (3) that your current research skills is sufficient to publish at great conferences, (4) that you are competitive and/or you can stay productive under the pressure of publishing at a top conference.

It helps to view this in the eyes of a potential advisor: If you have two students, one published already at NeurIPS (Tier A) and one you published at a Tier B conference. You would know that the first student is probably ready to work on a research project which is aiming for NeurIPS next year. The second student would need further preparation, for example, publish in a workshop or at a less competitive Tier A conference before making the step towards NeurIPS. With the second student, there is some risk that this student might take more than a year to acquire the research skills to needed to publish at Tier A conferences. Pushing a student towards NeurIPS can be stressful for an advisor and it is easier to work with someone who already has the necessary research skills. If there is less stress between advisor and student then its easier to develop a strong professional relationship which makes it easier and fun to work with each other. So a potential advisor would have good reasons to select according to the prestige of the conference where you published at.

Creativity, Citations, etcetera

Other indicators have little effect on the application. Your work might be unusually creative, but you have no track record that you are a creative researcher. Maybe you got lucky.

The importance of publications often only emerges with the years. Often you published shortly before the PhD applications which means that the citations that you have on your work is a poor indicator of impact. If you get an usually high number of citations in a short time this can help, but maybe you just got lucky or good at marketing. Usually, the number of citations over the past 1-3 years is no reliable indicator of research potential and as such is disregarded. If you have a citation history over the past 5 years this might be a different story, but this does not apply to most applicants.

Understanding the Statement of Purpose

For most institutions, the statement of purpose is mainly a filter for people who took no time to polish the SoP. Your writing can show how you think, how you can sell, how you explain things, but it can also show that you are lazy and do not pay attention to details. It can show that you are not able to Google simple recipes of how to write (and how not to write) a simple formal document. For some institutions, the SoP can be important (CMU) but the content does not really differ for these institutions.

Beyond formalities, the SoP is also the only document where you can justify why you did underperform in certain circumstances. For example, you can explain any extraordinary difficulties that you had along the way to graduate school, or it can explain why you did not do so well in certain semesters/quarters at uni. The structure of a SoP should be the following:

Intro to research interests with a catchy hook that makes the reader want to read more (one paragraph). This is the most important bit: If you do not interest your readers in this paragraph it is unlikely that they will focus on the rest of the letter.
The research experiences that you gathered along your way to grad school (about one page).
Identifying what research you want to do in the future.
Identify people with whom you want to work with and why.
(Optional) Explaining extenuating circumstances where appropriate.

In some circumstances, the SoP can be very important. This is so if you showed good — but not strong or weak — academic potential and you had to overcome significant hardship to be able to do research. If your application is strong and write about hardships it might alienate your readers (privileged prick); if your application is weak it might also alienate your reader (whining looser). If your application is good it is exactly right (a smart person that pushed through difficulties). For example, I had a rare situation where I was barred from university access and my SoP was very important to explain the difficulties that I faced under these circumstances.

However, disclosing hardships and weaknesses — like learning disabilities and mental illnesses — can also be double-edged sword: You might either alienate the readers or you might draw their sympathizes and admiration for persisting in a difficult situation. If you disclose such facts, it needs to be done right and the SoP needs to be extremely polished. Do not attempt this if you do not have the feedback from expert writers. For some stories which are more socially acceptable you do not need expert feedback to do it right: It is easy to write a compelling story where you worked yourself from extreme poverty into college and that you now want to realize your potential by doing a PhD; it is difficult to write a compelling story about the hardships that you faced while suffering from schizophrenia or bipolar disorder.

However, if you did not face any hardship do not make up stories that make no sense: “As a white, male, upper-class US citizen, I was haunted by the responsibility of my privilege from an early age and my academic performance suffered in the process.”, instead, concentrate on your research experience.

Understanding GRE, TOEFL, GPA

The GRE & TOEFL tests and GPA are usually used as filter criteria. A very high GPA can be a good indicator of “some intelligence” which can help with the recommendation letters and publications are borderline. But a GPA of 4.0 will not help if you have no publications and bad recommendation letters — it might even hurt you because it shows that you concentrate on useless classes rather than research. GRE and TOEFL scores are pure filters: If you have an okay score you are not filtered out. If you have a perfect GRE score, it can help a little bit but much less so than a perfect GPA. Great GRE scores do not matter: I got into three out of the top five US computer science programs with verbal 159 (81%), quantitative 163 (86%), writing 5.0 (93%) and a TOEFL 120/120 and a GPA of 8.1/10. Any GPA higher than 3.5 is good. Anything above 3.5 does not matter. A GPA of 4.0 might help a little bit.

Understanding the CV

The CV lists what you have done. There are no surprises here. The content is important but the content is also determined by what you have done before and cannot be changed. Do not “tune” your CV by phrasing things in a nice way or by making your CV look “nice” or “creative” — this is a waste of time. Just list what you have done.

The Application Process

How to ask your professor for a recommendation letter

You write two emails: (1) Just ask if the person can write you a good or strong recommendation letter. Knowledgeable recommendation letter writers will reject your request if they think they cannot write you a good letter. In this case, look for someone else. (2) If your recommender agrees she will ask you to include some information for the letter. Give a list of what you have done with the person. Write it in a style that can be easily wrapped into anecdotes:

DO: “You told me in a meeting that with some extra work we could make it for the NeurIPS deadline. In the next two weeks, I develop an improved deep network architecture started writing up the findings. The next week, Jane extended my code for an additional task. We then had enough results to submit our work to NeurIPS”
DO not write: “Jane and I published our research at NeurIPS.”

Anecdotes can also come from interactions with PhD students and post-docs:

“I worked with Tom on developing the research library that served as the main framework for our research that we published at NeurIPS. I worked one week on the library and Tom told me that the library was well designed and well performing.”

Your advisor will then ask the respective PhD student or post-doc for more information to write something like this:

“My PhD student Tom — whom I regard as one of my most engineering-savvy students — worked with Jane on a research project where we needed to develop a code-base for language modeling before we could start the research. Tom gave this task to Jane and estimated it to take 3 weeks. Jane completed it within one week. Tom told me that after he inspected Jane’s code in a code-review, he found that Jane’s engineering abilities are on-par or even exceed his own — the code was very high quality and lightning fast. Jane’s engineering skills helped with the rapid development of research ideas. The research project became a walk in the park because of this. Jane published her work at NeurIPS2020…”

(2) If you have three letters which are on or above the “Good” level, you should think about making your letters more diverse. I for example used one academic letter, one industry lab letter, and one letter from a lecturer who is aware of my research.

Statement of Purpose

Start early and ask experienced people for feedback. You should be safe if you follow the formula above. If you want to disclose difficulties that you had along the way to graduate school you will need a lot of time in your SoP and you can expect that the SoP will take by far the most time in all your application materials.

Try to reuse letters between universities. It takes too much time to “personalize” the SoP for universities. The only section that I changed in my SoP from university to university was the section that mentions the potential advisors I would like to work with.

Online Application

Start early filling out the online applications early. Some forms are terrible and take some time to fill out and it is great if you can get this out of the way as early as possible to focus on recommendation letters, university selection and your statement of purpose. You should have a good reserve of money to do these applications. The entire process might cost up to $1000. If you do not have the money, ask some relatives for some help early on.

How to Select Schools for PhD Applications?

Can I get admitted to a top school?

Many people reading this probably have the dream to get into a top school like Stanford, MIT, Berkeley, or CMU. But admission is really tough. Some programs are highly selective. Here admission statistics for one top school I was admitted to and the prior probability of getting admitted to the program. Note that I have hard statistics on the schools and publications, but I do not have hard statistics on the letters and personal connections but I make assumptions based on what I have heard and seen from admitted students that I talked to:

Top 2 undergraduate school AND 1 to 3 publications AND >=1 strong letter AND personal connections: 38%
Top 4 undergraduate school AND 1 to 3 publications AND >=1 strong letter AND personal connections: 14%
Top 20 undergraduate school AND 2 to 4 publications AND >=1 strong letter AND personal connections: 21%
Below top 20 undergraduate school AND best school in a country (Tokyo, Australian National) AND 2 to 4 publications AND 1>= strong letter AND personal connections: 11%
Master in top 3 school AND 1 to 4 publications AND >=1 strong letter AND personal connections: 5%
Below top 20 undergraduate school AND not the best school in a country AND >4 publications and >=2 strong letters AND personal connections: 5%
Below top 20 undergraduate school AND not the best school in a country AND >3 publications and >=2 strong letters AND award for Best Teacher/Young Scientist AND personal connections: 5%

This program, like most top programs, selects aggressively for undergrad degree. Note that usually, some form of personal connection (a letter writer knows a possible advisor at the school) is a requirement especially for edge cases. Other top programs select differently. For example, while CMU also selects aggressively for undergrad degree, they also like candidates with an unusual background which reflects strong performance under difficult circumstances. Some schools really like awards in math/CS competitions. Many schools like it if you got some form of best teacher award. Some schools like it if you have a portfolio of hacks (MIT). However, in general, in order of importance to get admitted to top schools:

Personal connections
Top undergrad school AND publications
Strong letters AND publications
Publications
Anything else

This means if you doing an undergrad at a top 2 school and you have no publications you will still have a hard time. Top 2 school and a publication increase your chances of admittance dramatically. If you have no personal connections it is difficult to get admitted even with a strong profile. However, if your profile is overly strong under respected advisors then personal connections do not matter.

There are some other factors for special cases. For example, if you study at a top school and have only 1 publication then GPA will be an important factor. However, in general, top schools do not care about GPA numbers from schools below top 20 if it is at least a GPA of 3.5 or equivalent. So if you have a GPA of 3.5 at a below top 20 school and you have 4 publications you have a good chance of getting admitted. A low GPA (which is still > 3.5) can be a factor in favor if your research profile is very strong as it demonstrates that you do not care about classes but that you are passionate about research — exactly what advisors want to see.

Another thing to note here is that we have publication inflation. This means the value of a single publication becomes less and less because more and more students fulfill this requirement. The more students are interested in ML PhDs the more stringent the publication requirements. It might have been fine to have no publications to get into an ML PhD, but this is often no longer the case.

How to get admitted to top schools?

These statistics above do not mean that you cannot get accepted by these schools, but it means that if your profile is too weak you should take another year to bolster it. I, for example, extended my master by a year to squeeze in a year of research internships. Without this, I would never have made it into these schools. If your dream is to get into one of these top schools this is by far the best option. Even if you do not necessarily want to get into top schools, a research internship is highly recommendable.

A research internship will give you:

Improved research skills so you can get an easier start into a PhD.
A test whether a PhD or a certain research direction (NLP vs computer vision vs systems) is right for you.
A good or even strong recommendation letter (the longer the internship the better).
A possible publication.

But even finding a research internship is easier said than done! How can you approach this? My next blog post will deal in detail with the topic of how you can improve your application file for the application cycle in the next year.

Realistic School Selection

You should apply for about 10-15 universities. If you apply for more, you run in the danger that you will not have enough time to really polish your applications. If you apply for less you run into the danger of not being accepted anywhere.

You should have one or two backup universities where it is likely that you are accepted (> 75%). Often the university where you already studied at is a good candidate for this since your recommendation letter writers will be known to the university faculty. Apply for all top universities where you have some hope of getting admitted (>10% chance). Fill out the rest of the university slots with universities where you expect to have a good admission rate (25-33%) — you should have a minimum of 3 universities of this kind. These universities are usually the ones where a recommendation letter writer has a personal connection to a faculty with whom you would like to work.

Note that the best advisors are not necessarily at the top schools. You can get excellent PhD training at many schools outside of the top 20. However, if you thinking about an academic career then the school rank will be really important and you should try to find an advisor at a top school.

Pick universities mainly according to possible advisors. Make sure each university has more than one advisor you would like to work with. Do not apply to a university where there is a single good advisor. If your list is too small, broaden your area of interest. For example, if you would like to do deep learning and NLP and you cannot find enough fitting advisors consider also some advisors in computer vision or other fields.

General Q&A

4 year UK PhD vs 6 year US PhD

In the first 1-2 years of a US PhD you will do quite a few classes since the US PhD is designed for bachelor students. On the contrary, the UK PhD is designed for students that have already a (1 year) master degree and will have few classes. Thus you can get started immediately with research in a UK PhD which can be a nice advantage.\

US PhD:

Designed for bachelor students
Classes for 1-2 years. Classes distract from research.
Funding guaranteed with admission, that is, you have guaranteed positions as a research assistant or a teaching assistant.

UK PhD

Designed for master students
Classes for 0.25 – 0.5 years. You can focus on your research from start to finish.
Funding can be problematic and is often dependent on your advisor. This is why it is important to get in touch with your potential advisor before you apply.
Less prestigious (in most cases) and thus it will be more difficult to get academic positions after your PhD. It will be more difficult to get oral presentations, best paper awards etc due to visibility bias.

Also be aware of local effects. If you study in the US you will also be in a US research bubble. Same is true if you study in Europe or Asia. For example, researchers in Europe know the “famous” researchers worldwide, but beyond that, they know more European universities than US universities in general (e.g. Stony Brooks vs University of Sheffield). Same is true for other locations. If you want to join academia in Europe, and you cannot get admitted to top US schools, it might make sense to apply for mostly EU universities.

Is a master required for a PhD?

In continental Europe, bachelor degrees are usually 3 years long and you require a master degree to start a PhD. In the US and UK, bachelors are often 4 years long and you can start a PhD right after your bachelor.

Does work experience matter?

It can help especially if you work at a prestigious institution (Google, Facebook, McKinsey, Goldman Sachs etc.). Other work experience can help if it is software engineering related, but any research experience (research internship) will be seen as far superior. Just a good job and no research experience will not help you.

How to pick advisors?

Look at recent publications to get a sense of overlapping interest. Avoid working with academics that did not publish papers recently. There does not need to be an overlap in current research, but you should be interested in the research that the advisor is doing.
Look at the list of students that graduated and where they are now. If you cannot find a list of students that graduated this is a red flag (or a new faculty). This is a good indicator of the quality of advice and training that you will get.
Does the advisor has a startup? How many students does the advisor have? The combination of these factors is a good indicator of how much time you can the advisor to have. Dependent on how experienced you are in research you will need an advisor that has more or less time.
Is there a fallback option in the same department? Sometimes relationships do not work out. Protect yourself by having a second advisor option as a fallback.

Should one even do a PhD?

If you want to work in academia you will need a PhD.

In industry, everything is regulated by supply and demand. The supply of AI researchers will rise sharply in the next years. If the AI hype collapses the demand will recede. The situation might be very similar to the situation that data scientists face in 2018: Companies only take over-qualified applicants because there is much more supply than demand. In this situation, a PhD will make a big difference if you want to switch jobs or want to be promoted. You might get hired without a PhD now, but without a PhD but you might have problems if you want to switch to another research lab (because the supply of skilled PhDs might be high, while demand is low).

If the AI hype does not collapse (unlikely) then you can find and switch jobs easily without a PhD. However, note promotion might still be more difficult and you might need to do more “research engineering work” compared to research. If you are happy with a research engineer position a PhD might be useless for you.

Do not do a PhD for the reasons above alone. If you do not want to do research do not do a PhD.

Contact advisor before application?

This can make sense if one recommendation letter writer can introduce you to a potential advisor. However, this is not required in the US. It can also backfire since it removes a shroud of mystery around you and sometimes it is more impressive to see your publications and recommendation letters first rather than to talk to you in person and seeing the recommendation letters afterward. In the EU it is sometimes required to contact a potential advisor before an application. If you need to do so, also try to get introduced via someone that knows your advisor personally, for example, your bachelor or master thesis advisor. If you do not have a personal connection to your personal advisor you might want to write an email with:

Your current advisor
A sentence about your past work (optionally: where did you publish your work?)
4 bullet points about potential work that you could do with the advisor in the form of “idea: One sentence that explains the idea”

It is very unlikely that your potential advisor will read and even reply you if you do not have a personal contact. If you do not have a personal contact and you apply to EU (UK) universities, then you might want to apply somewhere else.

How to pick a topic for your research proposal?

The topic for the research proposal does not matter. Nobody will ask you to do the work that you described in your research proposal. You can pick your research proposal topic based on how easy it would be to reuse it across different applications. If you do not need to rewrite it for different applications you save a lot of time. One thing to consider: The more familiar you are with a topic the easier it is to write a good proposal.

The post Machine Learning PhD Applications — Everything You Need to Know appeared first on Tim Dettmers.