Uncategorized Archives — Tim Dettmers

My Journey Towards Coding Agents: Building SERA

Tim Dettmers — Tue, 27 Jan 2026 12:32:00 +0000

If you look at how people cook coding agents today, they have an industrial kitchen: large-scale reinforcement learning systems with many components for efficiency spanning hundreds of GPUs, complex repository mixing, and large teams working on all angles to optimize the data pipeline for training. For the family of Open Coding Agents we released today from Ai2, we had the equivalent of a hot plate and a frying pan: 32 GPUs and five bright-eyed researchers who wanted to cook state-of-the-art coding agents.

This blog post is about how we cooked that coding agent.

Along the way, I will share the struggles and technical details that usually do not make it into papers: switching fields, disappointing people, and feeling irrelevant. And then the failures along the way when we actually tried to cook an agent with our frying pan, to the eventual breakthrough.

Once we had our first breakthrough, we could upgrade our frying pan to a full set of pots and an induction oven (96 GPUs). The result? A method that lets you cheaply, with just a couple of GPU days, finetune a 32B model on your own private codebase to create a coding agent that rivals or exceeds the teacher model’s performance on this private data. And our software makes it easy to deploy it in the Claude Code environment for a smooth experience.

What This Post Covers

This post has four parts:

1. The Struggle: Switching fields, the costs of starting over, and what it felt like to be in the middle of it.

2. Getting a Foothold: My early work on coding agents, taking an 8 billion parameter model from 0% to 24% on SWE-bench.

3. Data Generation Methods: Three approaches we tried: subtask splitting, end-to-end training, and finally SERA. If you want to skip to what actually worked, jump straight to the [SERA section](#sera-soft-verified-efficient-repository-agents). If you want to understand why we ended up there, read on.

4. What This Means For You: What private repo specialization can do for you, how to deploy these models, the Claude Code integration, and what becomes possible when coding agent research is cheap.

—

The Struggle

Switching Fields is Scary

I started in a good position. I had become an expert in quantization, developing methods such as QLoRA and k-bit inference scaling laws. These were hugely successful, particularly when I integrated them into the bitsandbytes library, which reached millions of monthly downloads. QLoRA became a staple in the industry. And all findings of k-bit inference scaling laws would eventually be implemented in Blackwell GPUs on the hardware level.

But research means extending knowledge, not just refining what you already know. While I was on the job market I already knew two things that forced me to change directions: (1) quantization research and other efficiency research hitting diminishing returns and are less exciting; (2) It was clear to me that chatbot models would not yield the productivity improvements we need. So in August 2024, I made the leap, abandoned all previous work, and started working on coding agents, which I thought was the most promising direction.

It was scary.

Switching fields from a one that you are well established in to something completely new is probably one of the hardest things you can do in research. People were excited that I was joining the Allen Institute for AI. I brought expertise with me. But after I joined, and people realized I was going to work on coding agents rather than something related to my previous expertise, they were surprised. Are you going to help me with efficient training and inference? Nope, I am going to learn about coding agents.

That felt selfish. I could really help my colleagues, but I chose not to, so I can focus on transitioning to coding agents and be helpful there instead.

That was particularly painful because I really like to help people – my main research goal remains to make the best AI available and accessible to anyone. I disappointed many people whom I liked and respected. And the costs accumulated. I had not published anything in about a year. Students in this PhD admission cycle are not excited to work with me as they saw that I produced nothing interesting. I watched as people paid less attention to me, as I became irrelevant. That is a particular kind of loneliness in research. Knowing you are working hard on something, but having nothing to show for it.

Getting a Foothold: Learning From Inspecting Trajectories

To get started on going agents I hurled myself into the depths of coding agents to build intuition and understanding that would guide my future research.

As a proponent of open source, I naturally worked with open-source models. When I looked at SWE-bench performance for an 8B model, it was basically 0%. SWE-bench is a benchmark where you receive a GitHub issue describing a bug in a codebase and need to fix it. Closed-source models at the time performed around 30%. The 8B model had 0%.

My first challenge: could I improve this 8B model as much as possible and make open source competitive?

Being resource-limited forces you to run fewer experiments, making each one more important. My first step was to understand and learning how to interpret trajectory data. So I download trajectories from frontier models and study them. I examined two things: the workflow models used to solve bugs, and the problems they encounter. This analysis helped me identify key patterns and strategies that I could adapt to improve open-source models.

Every closed-source coding model learns the same workflow, one that closely mirrors what humans do. They read the GitHub issue and try to understand where the related functions and classes are in the codebase. This leads to “rabbitholing” (as Ethan puts it), where a model goes both broad and sometimes deep into the code to find the bug and understand it thoroughly by examining how functions relate to other components. This builds the understanding that the model needs for the next steps.

The next steps can be interchanged: write a bug fix, then a replication script, or vice versa. Technically, writing a replication script first to verify the bug is better, but closed-source models sometimes interchange these steps. Then comes an iteration stage where editing continues until the replication no longer throws an error. Finally, you create a patch and submit it.

The Copy Problem

Looking at what was going wrong with small open-weight models, I found the core problem: they are very bad at copying information. This is critical for tool calling. If you want to explore a codebase, you need to reference function names exactly. These tool calling errors often confused small open-source models, where they would generate a reasoning trace of “This does not work. Let me try something else”. These small imprecisions stopped small open-weight models from getting any solutions at all.

The fix was straightforward: do a fuzzy search against existing functions so tool calls are made against the right function (instead of non-existing ones). This immediately led to an improvement from 0% to 7% solve rate. From there, I could run more trajectories and see other failures. The model sometimes failed to produce the closing tag required for tools, leading to no tool call and confusion that resulted in errors. Ensuring tags are always closed for tool calls pushed performance to 14%.

The big lesson: open-weight models were not good at copying information, while closed-source models excelled at it. Armed with this knowledge, I made further changes to make smaller models more robust when referring to symbolic information.

I pushed the 8 billion model to 24%, very close to GPT-4o at 25% for the scaffolds at that time. These quick improvements were mostly driven by my experimental methodology. I carefully estimate the variance in results so I can, with precision, say what works and what is noise. This was something I learned from writing the QLoRA paper, where I ran thousands of small experiments to verify which variables mattered at a larger scale.

This was exciting. I felt I was getting a foothold in coding agents. By November 2024, people at AI2 were interested in applying these methods to their agent workflows. Nobody at AI2 was working on coding agents specifically, but people were working on scientific agents, so I shifted to building a general agent framework. Then came the health problems, and the pause.

The Setback

When I was starting to make progress, misfortune struck. I developed health problems that forced me to take a break. While that was a difficult decision, I stopped working in February 2025.

I had committed to hiring an intern, Ethan Shen, who was highly recommended with an impressive background. Even though I could not work and program myself, I wanted to mentor him. Ethan started one day before Claude Code was released. He had not worked on coding agents before.

But Ethan learned quickly. Remarkably quickly. Together with my mentoring, he made rapid progress. A couple of months into the coding agents project, he would reach out to other researchers about coding agents, and it turned out he often had more insights into what determines good performance than they did. He became one of the most knowledgeable people in the coding agent space. I like to think I had a role in Ethan’s growth, but really, Ethan is just exceptionally good.

When Ethan and I achieved our first state-of-the-art results with the subtask-splitting method, it proved that all the struggles and setbacks were worth it. This progress can inspire you to keep pushing through your own challenges in research.

Data Generation Methods

When I fully returned to work after my health-related break, the landscape had shifted. Qwen 3 had been released with improved reasoning capabilities. Open-source models now have the precision to make mostly correct tool calls. The bottleneck was no longer scaffolding. It was data.

Working with real training data from real GitHub issues was not scalable. Data with real data is difficult, particularly when you have only a single person new to coding agents working on this. So the choice was obvious: the only scalable solution was generating synthetic data.

At the time, most people wanted to generate correct training data: find GitHub issues with bugs, which gives you the final solution (the patch), the faulty state (the bug), and the repository. Then generate synthetic training data from a larger model’s rollout, which you use to fine-tune a smaller model.

For efficient synthetic data, you need a different approach. The core problem is generating labels. A simple approach: start with the label, then mutate your data to generate the example. This means: start with a correct codebase, generate a synthetic bug, then use the bug as the input and the correct codebase as the target. This makes scaling easy.

But here is where the resource constraint bit hard. Coding agents usually need massive amounts of compute. And we only had 32 GPUs at that time. This is the same problem I faced with QLoRA: you need resources to learn to be resourceful, but you do not have them yet. Eventually, it becomes a virtuous cycle, a flywheel. But first, you need to get it spinning but getting some level of efficiency to make your resources sufficient for further progress.

We tried three approaches. Each one taught us something that led to the next.

Approach 1: Subtask Splitting

Code search is one of the most important problems in coding agents. If you cannot find the bug, you cannot solve it. If you find it, you have a good chance. So the most important thing to learn from a trajectory is searching the codebase. Why not learn that first, then learn editing afterwards?

This has an advantage: subtask methods might transfer across problems. Searching a codebase is useful for many digital problems. Editing is used to add features, fix bugs, and more. By splitting into subtasks, we hoped to gain efficiency that spreads to many subtasks. Doing well on search first would make learning to edit more efficient, since we could already find bugs with higher precision.

The problem: you have no labels for subtasks. But you can use a mutation method. Start with a correct codebase, mutate it into a problem state, then use the original state as the label and the mutated state as input. This generates countless tasks for learning to search. We refer to this problem as the proxy task problem: how to find a task that indirectly helps you learn what you care about.

Finding Good Proxy Tasks

How do you find which task is good for generating data? There is a simple, effective method. Take multiple models, generate a proxy task via mutation, evaluate them on both the proxy task and SWE-bench, and then compute the correlation.

A relevant task shows a correlation close to 1: better proxy task performance means better SWE-bench performance. A bad proxy task shows low correlation.

The best search proxy task that dramatically improves the performance on SWE-bench and which has almost a 1.0 correlation with SWE-bench performance is the following: take a function, have a language model summarize it, then create a task where the model gets the summary and must search the codebase to find that function. This task has almost a one-to-one correlation with SWE-bench, showing search is critical, and you can easily create tasks that mirror this.

Fine-tuning on just 500 samples from this task makes 8 billion models almost as good as frontier models at searching codebases. This heavily boosts open-weight model performance and easily reaches state-of-the-art results (at that time). This was our first state-of-the-art result.

There was a thought to publish here, but I have a tendency to push further. My thought is this: a bad paper has no impact, a good paper has some impact, a great paper has a massive impact. If you have a good result, why publish? Go for the great result instead. What if you get scooped while you try to get a great result? Not too bad. You quickly publish, and you still have a good result.

Ethan saw it the same way and we pushed on.

The Editing Step

With the search solved, the second step was editing once you found the function.

Our method: take a correct codebase, insert a bug, describe it in a GitHub issue like SWE-bench, and provide the altered codebase with the bug plus a search trace from the first subtask. The model uses the search context pointing to the correct function and learns to edit it.

Around this time, John Yang and some close friends released SWE-Smith, using the same method. The difference: they used unit tests to verify that generated bugs actually break things. We did not have resources for that infrastructure.

Despite unverified bugs, we got good results. However, while search needs only 500 examples for near-frontier performance, editing is different. It requires precise updates, so we needed many samples with many different bugs.

The procedure was efficient, as we could append the search trace and reproduce the bug. But we realized that the editing model was as computationally expensive as an end-to-end approach that combined both steps. And when we ran an end-to-end experiment, we got results nearly as good as our split task generation with the only difference being that end-to-end is much simpler. So we decided to switch to end-to-end generation and training.

Approach 2: End-to-End Generation

End-to-end generation is essentially something like SWE-smith: (1) mutate a correct codebase to get a bug; (2) create a GitHub issue; (3) generate rollouts from a model; (4) train on the data (GitHub issue, mutated codebase, trajectory).

Generating lots of data is expensive, so we settled on not verifying or running tests. We had basically only Ethan working on the project at this point, and this full verification method would not be feasible. So we first tried to do hard verification of our data and instead we settled on soft verification.

Soft verification means we trained on data that could be incorrect. We only compare the generated patch with the mutated patch, looking for partial line-by-line overlap. If the model generates a patch that overlaps with the target patch by 50%, we accept it as training data.This reduced costs significantly, since tests consume time and CPU cycles for test verification.

We also needed a cheap model for data generation. GLM 4.5-Air and 4.6 give strong SWE-bench results and are open-weight, but the full GLM 4.6 is too big for the 32 GPUs that we needed for generation, training, and evaluation. The GLM 4.5-Air model hit the sweet spot: strong performance at low cost on a few GPUs. This gave us low verification complexity and efficient generation.

Cost-effectiveness spins up a flywheel: more efficient resourceuse leads to faster experimentation, which leads to faster progress.

Since we settled on no hard verification, our method is feasible with the 32 GPUs we had at that time.

But doing this revealed something unexpected.

Approach 3: Soft-verified Generation – Embracing Vagueness

When we generate bugs, you need to provide the model a hint of what the bug is without giving it away. This is not easy. This means one often has a function as a starting point and a vague description of the bug. What we did initially is to analyze the call graph of a codebase, then find functions that build a chain func A -> func B -> func C. Then we would prompt the model with “There is a bug downstream from A”. With this vague description, the model often produces all kinds of different “bug fixes”. A bug from function reuse might lead to a solution looking more like refactoring than a bug fix. We have refactoring traces even when asking for bug fixes.

At first, this seemed like a problem. But at this point, Saurabh Shah joined the project. He built reinforcement learning (RL) infrastructure and trained a lot with RL. He told us that his is actually pretty common for RL too. Training on incorrect data often gives good performance.

So we doubled down: instead to tring to fix this incorrect data, we embraced it, and that led to something unintuitive.

Instead of trying to generate training code that is as correct as possible, we do something simpler – and something slightly wacky. Take a correct codebase, select a random function, and tell the model there is a “bug” downstream from that function (even if there is no downstream function).

To prompt a model with a correct codebase and “There is a bug somewhere here. Please find it and fix it” is ridiculous – there is no bug! But when you realize that most of the generations are never about bugs anyway, this is more sensible. And the other side of this is that it is very cheap, and very easy to do. It is just prompt + function.

The model explores from that function, looks at the code, and imagines a bug that does not actually exist. It might uncover a missing edge case, a missing assertion, inaccurate documentation, poor code that needs to be refactored, or a “real bug”. As such, while instructed to fix a bug, the model creates many other sensible improvements to the already correct code.

To keep generation cheap, we continue the pipeline as usual in the second rollout: generate a GitHub issue from that first rollout and its patch, then do another rollout starting from the GitHub issue and the correct codebase to finally create a second patch. This leads us to two trajectories: one with a GitHub issue and another without.

This generates lots of training data from just random functions in a codebase. We compare the overlap between the two patches with soft verification. Two trajectories in succession make everything compute-efficient and simple. All you need is a codebase with functions and a model that can generate trajectories through two calls.

—

SERA: Soft Verified Efficient Repository Agents

This approach gives us a much higher data limit. Typical data generation pipelines are limited by how many sensible bugs you can create. Verifying bugs to actually cause tests to fail further limits this. You also need a starting point that does not reveal the bug location. This means many other synthetic data approaches can only generate a very limited amount of data from a single repository.

We have none of these limitations. From a codebase with a thousand functions, we can generate a thousand trajectories. To amplify further, we examined papers analyzing common engineering bugs, getting a distribution of 51 bug types.

For each function, we can generate 51 different bugs. A codebase with 1,000 functions yields 51,000 trajectories easily and cheaply. We are almost not data-limited – even for a single repository.

This result leads to a core advantage of SERA: private codebase specialization.

Closed-source models do well on popular programming languages and common patterns. But on less-represented languages or proprietary codebases, they struggle. The holy grail is to specialize a model by training it on private data without exposing that data to a provider.

We generate massive amounts of data for a particular repository. Generating 7,000 trajectories takes just 19 GPU days. Training on these trajectories gives us a 32 billion model as good as the teacher, GLM 4.5-Air, the model from which we generated data. This works for other teacher models too, meaning you can compress frontier-level performance specialized to private data into a small, easily deployable model.

With enough data, we exceed the performance of the teacher model. This makes sense: we generate data from private repositories that the teacher has never seen. The student can exceed the teacher through specialized data.

This is a massive result. It helps companies quickly specialize a small model to frontier-level performance, or exceed it, on their private data.

Simple to Use Infrastructure Including Claude Code Integration

Ethan, Saurabh, and I knew we had something special, but we wanted to have a nice release and we were thin in terms of building that release. So Danny Tormoen joined us – and boy, did he do a good job!

Danny very quickly developed a converter that adapts the SWE-agent toolcalls to be compatible with the Anthropic API and convert responses back to what SWE-agent understands. Danny also created a deployment script for Modal that you can run without a credit card. You just need to create an account and you can try our agent in Claude Code for a couple of hours. All these tools are usable with 1-2 lines of code or commands. The same is true for out training and data generation pipeline.

With this, we could quickly start dogfooding our model in Claude Code. Initial impressions are “a pretty good model for its size,” but the model also has a habit of wanting to submit patches after some iterations – a clear artifact of the training procedure.

What This Means For You

With SERA you can easily and cheaply train high-performing coding agents on your personal of proprietary repositories and deploy these agents seamlessly in a Claude Code interface. The Claude Code is seamless – you can deploy with a single line of code.

A 32 billion model will not give you frontier performance right now. But it gives you a strong, specialized model for your private code, and you do not need to send your data to any provider.

More importantly, by tinkering with this, you gain skills. Once more advanced methods become available, you will be ready to use them. The trajectory of this field suggests that matching or exceeding frontier-level performance on private data is within reach. And SERA is a very exciting result in this direction.

I am very excited to see what you will build and do with SERA — the first release of Ai2’2 Open Coding Agents family — and specialized coding agents for your data.

Conclusion

Because our method is cheap, it opens coding agent research to everyone. You do not need large teams to maintain complex reinforcement learning setups. You do not need thousands of dollars to replicate a baseline.

Our baseline, which beats the previous state-of-the-art open-source method, costs $500 to run. Any lab with a few GPUs can now do coding agent research.

I would not be surprised if academic labs reproduce frontier-level coding agent performance for private coding data by the end of 2026. Open-weight models are already approaching that level and I think we can shrink everything done to 100B maybe 32B models that are super easy to deploy and use. With our method and future methods, you can specialize in your private data without exposing it.

There is a real promise here: open-source models that exceed frontier performance because you use your private data, train your own model, and deploy it yourself.

This is how we cooked our coding agent with a hot plate and a frying pan. The industrial kitchen is nice, but it is not required. Sometimes constraints force you to find simpler solutions – and sometimes those solutions turn out to be better.

The post My Journey Towards Coding Agents: Building SERA appeared first on Tim Dettmers.

Use Agents or Be Left Behind? A Personal Guide to Automating Your Own Work

Tim Dettmers — Tue, 13 Jan 2026 12:56:37 +0000

If you are reading this, you probably feel the FOMO. Maybe you have seen the Twitter threads about coding agents completing entire features in minutes. Maybe a colleague mentioned they are “10x more productive” now — or “Influencers” saying AGI is here and you need to learn their particular thing now. Maybe you tried Claude Code and felt confused about why the magic everyone talks about is not working for you. This blog post is for those who want to cut through the hype and understand what actually works, what does not, and how to think about using agents to automate your own job further and further to be more productive.

I have been using agents — primarily Claude Code — for eight months to automate my own work. What you will read here is not speculation or theory. It is the product of hundreds of hours of experimentation, many failures, and some surprising successes. As a professor who does not write much code anymore, my perspective is different from the software engineering discourse that dominates Twitter. Most of my agent use is actually for writing — blog posts, grant proposals, meta reviews. While these problems might be non-traditional, they provide the exact view of how to use coding agents for all kinds of tasks even beyond coding itself. This helps you understand how far you can go in all different directions of agent-use.

Just to give you a hint of how powerful agents have been for me: usually, the first year as a professor is very stressful and involves a lot of work. For me, it felt easy. I had some luck here and there, but I believe the use of agents is a significant reason why things have been manageable for me when they are hard for others.

This blog post is my attempt to share what I have learned so that it might help you grow in the long term. I will detail a number of different things I have built with agents — which succeeded and which failed. There are plenty of blog posts out there about software engineering with agents; this one tries to give a broader, more balanced view.

Before I worked in AI, I spent three years in the automation industry in Germany, developing SCADA systems. That experience taught me how to think about automation systematically — when it makes sense, when it does not, and how to build skills over time. I bring that perspective here, along with concepts like process optimization that are standard in manufacturing but rarely discussed in the context of AI agents.

What Is Hype and What Is Real

This blog post is mostly about my own experience with concrete examples of successes and failure cases. This is real. And I want to contrast this slightly with the Twitter discourse, which can create unnecessary FOMO.

Hype: Vast parallelization, large productivity increases, and autonomy

What you see on Twitter is mostly about software engineering. And while agents in software engineering are real and powerful, software engineering is very unlike most other problems.

Firstly, in software engineering, you often have many parallel problems that you work on independently: bugs, new features, quality control and refactoring, GitHub discussions, and reviews. All of these tasks are independent and can be parallelized. The larger the codebase, the more independent work can be done, and the more parallel sessions become useful.

But here is the thing: the concept of parallel sessions is an agentic workflow pattern that is useful for software engineering and some other tasks, but most tasks cannot benefit from parallelization.

Secondly, while productivity gains in software engineering are real, they do not automatically translate to all tasks. Coding is a very general capability. Theoretically, you can do anything digital – which spans a lot of tasks. But in practice, automation of many other non-software-engineering tasks is very difficult or has small payoffs. Later in this blog post, I will give you a framework for how to think about this more broadly.

Thirdly, while a fully autonomous system can be impressive, real work that is useful often creates design decisions. Iteratively designing a system with shorter bursts of autonomy with feedback loops where they shape the final solution that you want can be much more effective than just rolling out agents until a solution is achieved. It works, but it is not very helpful for most work because the quality is too low.

Real: Agents Should Be Used Everywhere

This blog post is about using coding agents for all kinds of tasks and how I learned from that experience. After 8 months of Claude Code use and trying to automate countless tasks, here is my honest assessment: more than 90% of code and text should be written by agents. You need to do so, or you will be left behind.

This statement might seem controversial and as FOMO-inducing as the software engineering story I just critiqued. But I believe it is reality, and understanding how to adjust to that reality will be a big part of everyone’s jobs going forward. This blog post is an attempt to share my knowledge to help you on this path.

When I talk with people about this, a lot of people push back vehemently. Generate all your work with AI? They find it ridiculous. How can generic boilerplate generation replace the intricate style of a well-designed software system? It feels absurd for them to replace the immediately noticeable and distinct style of a writer with an AI-generated slop wall of text.

Why AI-Generated Content Is Personal, Not Generic

AI-generated content is personal content. When I explore a concept with Claude, the output is not generic. It is shaped entirely by my thinking, my style of conversation.

I really like connections between fields, and the topics I explore are highly personal and unique traces of my thinking — and with that, my taste.

Let me give you a vivid example. I once started a conversation about jihad — the concept in Islamic theology that is often misunderstood, about the inner struggle to do the right thing when it is difficult, but it really matters. From there, I ended up connecting it to Krishna’s advice to Arjuna about doing the right thing and to not worry about the outcome by surrendering the fruits of actions to him (karma yoga), to Lutheran grace that is purest when it emerges at the height of struggle, to Taoist Wu Wei where struggle disappears through letting go and letting your nature take over, to Beowulfian naked will against overwhelming odds — the struggle with Grendel as a symbol to surrender to your fate.

None of this exists in any textbook or on the internet. It is a fingerprint. A very personal fingerprint. If you were to read the details of these conversations, you would know parts of me intimately — who I am and why I am that way. You would know me to a degree that is usually only reserved for close friends and your partner.

Someone thinking that AI-generated content is impersonal and generic is deeply mistaken. The concept of soulless AI generation is an artifact of less powerful AI, or the mistake of seeing your own generations as what AI can do, rather than recognizing it as a skill issue that has to be overcome.

Useful Background: How to Think About Automation

The Basic Calculus of Automation

Before I worked in AI, I worked in the automation industry in Germany. I was developing SCADA systems — integrating data from machines and databases to enable the control of workflows via data and monitoring. The knowledge gained in these three years in the automation industry applies directly to automating your own work with agents.

The first important question is: when should you automate versus when should you not? While people always think about automation in terms of full automation, this is almost never the case in practice. You always have a degree of automation.

Here is how to think about useful automation: if you take your current degree of automation and increase it by a new technology, then you improve by a certain percentage. If a task takes 10 hours and you improve the degree of automation by 10%, then it takes 9 hours.

With this view, there is a simple calculus: how often do you do the task, and how long do you need to automate this task to improve the degree of automation by 10%? If this calculus leads to a result where the cost of automating something is higher than the gain, then the problem is not fit for automation. The task should be done manually. There are many tasks that should not be automated because it is not effective. I will give a length example about email automation, where I tried hard, but it is a problem where automation fails.

Additionally, as your workflow changes, it adds overhead. For example, you might save 30 seconds, but if your agent needs 30 seconds to generate that automation, then the effectiveness is 0%. The degree of automation improves by 0%.

If you invest so much time into improving your work with agents, you want to make sure that it actually helps. This simple calculus – while not perfect – is a simple tool to help you decide where to start with automating your work.

The Method: Process Optimization

A very basic method in factory automation is process optimization. You are on the factory floor with a stopwatch. You look at and study what workers are doing. You time each step: how they hand off work to another person, when they wait for previous work to complete, how many resources they need, and if they wait for resources.

If you have all these components, you can construct a workflow — a process that reflects the current state of how work is done. Based on this, you can think about how to optimize. This thinking is extremely important for automating your own work. You have to think about how you work on a particular problem and how you can change your work with agents. Using this framework of process optimization can be extremely helpful to get a quick sense of how much productivity gains you can achieve with a particular solution. Sometimes you find that the process cannot be optimized much with agents – that saves you a lot of time.

Let me give you a concrete example. If I take one minute to read an email and 30 seconds to reply, then it takes 1 minute 30 seconds to complete an email. Now, if I use an agent to help me with my emails, then I need to guide it to process my emails. Then I need to read that content to decide how it should create drafts or answer emails so that I can then edit those drafts. But once you do this exercise, you realize that by using agents, you just shift the process and do some automatic generations — but you still need to read content, make sure it is aligned with your intent, and you need to edit the draft if it does not match exactly with what you wanted.

There are certain emails that are easy to automate. There are others that are not. It depends on your process and your underlying inputs to see if using agents and changing your process can actually lead to productivity.

Reading an email or reading AI-generated content has a cost. You need to include that cost in your process optimization thinking to understand if your process can benefit from automation. This insight — which seems obvious but is often ignored — is fundamental to achieving higher and higher degrees of automation.

The Long-Term Perspective: Building Automation Muscles

The process perspective I just gave is a short-term view. You look at the underlying processes and the degree of automation, then think about how long you will need to automate that work and how much you increase the degree of automation. It is a simple calculus. This is classic automation, how it is done in Germany and other countries. Very cost-effective and optimal in the short term.

However, it is very short-sighted because it does not consider long-term consequences.

The long-term view is a Shenzhen-style perspective. It is not about making any automation useful in the short term. It is about making automation useful in the long run by gathering knowledge that improves automation over time.

It is essentially short-term calculus with a meta-automation step added: even if the degree of automation is not worth it in the short term, will the skills I build and the tools I develop make future automation effective that was previously ineffective? Does the additional knowledge help me with future automation effectiveness?

This is exactly what led from Shenzhen-style scrappy factories to highly structured dark factories that are fully automated. Chinese automation is far superior to Western automation, not because of scale, but because the long-term view of automation led to a higher quality and degree of automation.

This is an important concept. You need to optimize both short-term and long-term perspectives to effectively automate your own job. Europe is struggling because of its short-term view of automation. The US is struggling in many segments because it did not build the long-term skillset that is required to build the automation muscles to tackle more challenging problems.

In other words, using agents and failing at automating a task effectively is important. You need to gain skills to improve future automation, and that means sometimes trying to automate things that you know you will not be able to automate.

A key part of making sure you learn over time is important. Often, you learn more from failure than successes and with agents, it is no different.

Why Automating Your Job Is Good for You

Software engineers are not replaceable. They just level up. The current hiring challenges are driven by COVID and financial dynamics much more than by AI. Software engineers are now much more effective at building software more rapidly, and the value of software has not decreased significantly. An engineer who uses agents like a hot knife slicing through butter is actually more valuable because they can produce more software that still has significant value.

A common view, particularly from the Bay Area, is that this is the current state, but software engineering will be fully automated very soon. I have many friends at frontier labs who had this view about nine months ago, but it has broadly changed. They see that it is difficult to automate their own work and that, as they use these tools, new problems open up.

Even if an agent can do everything, it cannot do everything at the same time. If you have a limited amount of GPUs, you want to direct agents to tasks that are useful for you so they can generate value where you need it. While even that can be partially automated, once your existence is at stake, you probably want to direct what agents do yourself — at least specify the problem and solution you want.

I think it will be a long time until you use an agent to manage your retirement savings by analyzing the stock market and optimizing it fully autonomously. But what is more reasonable is that you build a system where you tell an agent what risk you are happy to accept and how to optimize this risk through hedging, so that you might manage your retirement fund with a trade-off between potential upside and risk over time. It would be unwise to fully trust an agent if you do not know the parameters that are important for you and how the agent chooses those parameters.

If resources are limited, you want to decide how those resources are used rather than fully trusting an agent. And if this is true, then directing agents will remain a problem, even if agents can do everything, because agents cannot do everything at once, because resources are finite.

Long story short, because of this resource problem, there will always be work where your personal preferences, decisions, and taste will be needed — even if 90% of the work happens through AI. From software engineering, we already see that these changes work, but they will not eliminate many jobs that we thought would be automated away quickly.

I think the other direction is actually more pressing: if you do not know how to use agents, you will not have a good job or be able to find a job. Agent use is now an essential skill that you need to develop and master.

My Personal Experience with Automating My Own Work

Personal tools and pipelines

What is most common on Twitter are examples of successful agent use, where people create a tool that is useful for them. Small extensions that help your everyday life — just vibe coding something that you always wanted and that is simple, but nobody provided.

While this is a simplistic way of using agents, it has its importance. This is a problem where agents work really well, and they require very little skill to be used correctly.

For example, I built tools that help me write this blog post. I built tools that help me work with agents. One of the most important tools is a voice tool, which helps me quickly interact with agents, particularly for parallel sessions. A voice tool also helps me because I have carpal tunnel in both my hands. Typing can be painful. I have a very custom keyboard layout and way of working with the keyboard that reduces pain to almost zero, but still, it is much more comfortable to just use my voice. And it is not only comfortable, it is also faster.

A main advantage is that with voice, you can inspect outputs and use your keyboard and mouse while narrating. This is extremely powerful. A key tool that everybody should develop is their own voice tool to use AI in this way, where they can do work while narrating.

Tools for Students

Finding related papers: Replication of Connected Paper

Another tool I built was to solve the problem of finding related work. The most useful tool I have ever used for this was Connected Papers. It was free at some time, but then it became commercial. I need something like this at the beginning of a project and when writing the related work section of a paper. I did not want to pay for the subscription, and I wanted my students to be able to access it. So I just replicated the entire software system.

This was probably not effective for automation in the short term — I could have just paid for Connected Papers subscription. But it gave me an overview of what I can do in the long term: what tools can I build, what is too ambitious, what is less ambitious, and how can I be more effective when creating complex tools.

My connected papers replication uses the Semantic Scholar API to retrieve data. Then it builds statistics on the citation graph of papers to find papers that are very similar to what Connected Papers finds. The key insight I had is how Connected Papers works: it finds papers that are in conversation. They are often indirectly related through a third paper that cites both of them. They create a chain, a loop of three papers, and this loop is distributed. If you have this loop across three-paper chains that create circles, you have a very good way of finding related papers.

The tool that I created was very useful, but here is where it failed: the user interface. The algorithm works well, making the software easy to use for others turned out to be the hard part. My students needed to execute a Python command and use a password to extract an API key – it is a mess to get started. Instead of local deployment, I should have built a deployment that is just a regular website that you can access anywhere with just your browser.

If you want to create a tool that improves the degree of automation of a task, a useful tool is often not enough. You need to figure out how you and other people can use these tools intuitively and effectively.

You see that even creating simple tools like this connected paper replication, can have its own complexity. But such failure cases give you perspective: While not highly successful in the short term, it will give you the skills needed to tackle problems more effectively in the long-term future.

I would encourage you to spend some time on projects that do not offer a high gain in terms of degree of automation, just for the sake of having more diverse points of failure that you encounter, which will inform your future automations.

Exploiting coding agents as an API

Other tools I build are mostly for my students. It was recently revealed that quite a bit of Claude Code use is actually by exploiting it as an API for regular LLM calls. This means you use a Claude Code endpoint just as an API to use it in other work. While I do not use Anthropic for this, there are other providers where you can get frontier capabilities at the cost of 1% of the usual API costs. So you get regular API calls, but at 1% of the price.

I built this pipeline for my students a couple of months ago, and when I asked my students if they needed any GPUs for their projects, they said no — they just generate evaluations and research directions with the API that I created. It has been a very useful tool. For a research group, having easy access to frontier model capabilities without the cost that is typical for APIs is liberating for research. And I built this tool in about 2 hours. This is where good tooling really paid off.

Other Tooling for Students

Other tooling that is much more straightforward includes infrastructure for Slurm and a common infrastructure to analyze experiments. I believe Weights and Biases actually harms research by biasing the interpretation of results and how experiments are run, and so the custom tools that I have in the pipeline will help my students to avoid this bias.

A tool I have not developed yet, but which my colleagues have mentioned is a review system where students can get feedback on ideas or papers by querying an agent or an agentic pipeline that mimics how they, their academic advisor, would give feedback to the students. Imagine a student being able to get a first-pass critique of their paper at any time they like without being embarrassed about it or worrying about perceptions. This would not replace advising, but it would make our collaborations more productive by handling the basic structural and clarity feedback automatically.

While not all of these tools might be useful, and some are more like distractions, it is clear that with the right pipelines, workflow, and tools, productivity for students can be increased dramatically — and this can be driven by an advisor who invests in building these systems.

Similarly, a technical manager can develop tools and guide a team in this way. Even if agents cannot do all the work, you need to figure out what work you actually want to do and how you want to build on each other’s work as a team. Agents can work independently, but it might not be useful if your team is pulling on different ends of a problem. If coordination is missing, and everyone is using agents in their own way, it can lead to disaster. The tools an advisor or manager builds can provide that coordination layer.

All of these examples highlight where tools fail, where tools can be useful, and where tools might not be useful in the short term but give you the skills to improve tools in the future.

Writing Tasks

Blog Posts

You might have guessed it already: this blog post is AI-generated. My previous post about Why AGI Will Never Happen was AI-generate too. More than 95% of the text from both blog posts comes from an AI model. I did not even type prompts. Most of it was just me rambling into a microphone while doing other things, then transcribing that voice into text, shaping it into a blog post, then doing a style transfer to shape it into my voice, and then adding some small snippets that have character.

The editing and adding the small snippets, this last 5%, is a cherry on top that is very important. But they key point stands, 95% is AI-generated by I bet you still find this useful and enjoy the read. It has my style and my voice of writing and presenting information. Processing information in this way, to really make writing personal, is not that difficult if you use AI agents well.

While I am still experimenting with blog posts, this pipeline allows me to write blog posts much more quickly — and blog posts that are much more current. A blog post like this would have taken me days in the past. Now it takes about 3 hours: one hour to speak content into a microphone, 10 minutes for my agentic workflow, and then ~2 hours of reviewing and editing. It is very fast, and when using the agents, you notice that the quality is pretty good.

Would you agree that this blog post has soul? Or is it AI slop now that you know it is AI-generated?

Writing Tasks: Grant Proposals

Grant proposals are a major time sink as an academic. A CMU student costs $150,000, and I need to find that money by writing grant proposals. A lot of proposals are rejected, so you have to write lots of them.

It is interesting because while you might think the blog post approach should work, it actually does not work that well. Grant proposals need to have a particular structure, and even small deviations read poorly. Good design is familiar design, and good proposals are familiar proposals.

This is just like a good abstract — for example, an abstract in Nature has almost always the same structure, sentence by sentence, the same for every paper. That makes it easy to read abstracts because you know where to find information.

I am dyslexic, and reading is very slow for me, but I learned to read papers at a relatively okay pace because I understand that they have a common structure that repeats again, and again, and again. I can skip sections, skip to particular phrases, and I know where an interesting part begins. If the introduction says “In this paper” or “Here,” then I know now the list of contributions starts.

Grant proposals are highly structured. A free-flowing, talkative approach that I use for blog posts does not work out of the box, but it can be made to work by introducing an abstraction pattern.

This abstraction pattern works as follows: you create sentence-by-sentence extractions of what the grant proposal content should be. For example, for an abstract:

The first sentence is about the general field and the subfield
The second sentence mentions the problem, why it is important, and why it has not been solved
The third sentence states your contributions and your main results
The fourth sentence explains your main method
Then, depending on taste, you expand on this method or keep it brief
Finally, you state the impact and broad implications

If you have an AI model, you can apply this process very easily. Just take a couple of grant proposals of your own or others that you really liked. Then use an agent to do this, sentence by sentence, then merge multiple abstracted structures by commonality.

Then I use this structure together with an agent to create an interactive flow: The agent gives me particular questions, and I respond with a voice message about that content that I want – this is often casual “rambling” about my research that I want to do. After each response, the agent stores the content and analyzes the abstract template if key information is missing. The agent asks follow-up questions, and I answer them with my voice tool.

I then have the agent generate the draft and then smooth it over by doing style transfer using particular proposals that I have written and like.

With this, I can create a four-page grant proposal in about an hour and a half — even faster than a blog post.

Meta Reviews

Machine learning conferences are notorious for bad reviewing. The reviewing system is broken. There have been studies on ICLR and NeurIPS with clear results: reviewing does not work. Reviewing can identify the really worst papers and the best papers, but in-between it is a coin flip.

The finding from these studies is that reviewing quality is not related to knowledge but related to effort. Undergrad students have much higher quality reviews than PhD students or professors because they have more time and take it more seriously. For PhD students and professors, it is a chore.

Looking at that reality, using agents becomes very straightforward, and I would argue an imperative to improve review quality by reducing the time needed for reviewing.

In this case, we look at meta reviewing, reviewing the reviews, which is the task of an area chair. There are two philosophies about being an area chair. One is that you bring your own opinions and overrule decisions. The other is to follow what the reviewers said. I believe the second is more intellectually honest. While I have expertise and sometimes will overwrite reviewers, I have not had the depth to read every paper thoroughly, and certain concerns might be valid. A good paper is not a paper that I like, but a paper that is useful for the research community, and usefulness is difficult to judge if you do not read a paper in depth.

What I built to help with meta reviewing is a system to analyze the discussions, the points where reviewers disagree, give summaries of papers, summarize which papers are borderline, and identify which are clear rejects or accepts. The clear accepts or rejects have high score variability. These reviews can be processed quickly — you can understand why people have certain views.

The workflow is as follows: An agent uses my OpenReview login details to log in, navigate to the papers, get all the reviews and rebuttals, and store them to disk. Then the interactive part with the agent starts that helps me to understand where the issues are.

What is more subtle is tracking changes in the discussion. With the rebuttal, even if the score is not increased (which is very common because people do not have time), the rebuttal might contain information that could change the outcome.

From all this discussion about borderline cases, you can easily draft the first meta review. If it looks strange, you can ask the agent to explain, provide more detail, or provide evidence. It is a very interactive way of reviewing and actually mirrors what I would do without AI agents: separate straightforward and difficult cases; analyze difficult cases for disagreement; figure out which arguments have merit and if author rebuttals change the picture; draft a review; edit by looking at details; submit.

All these things can be done by an agent, and they can be done faster and probably more precisely. Understanding a subtle argument of a paper I have not seen before, between reviewers with different perspectives — this is hard if it is 5 PM and I have already had eight meetings, and I am just tired. But if I do it with my voice tool and my meta review agent system, this allows me to write high-quality meta reviews and make decisions that consider all information and arguments of reviewers and authors carefully.

The use of agents for meta reviews might be highly controversial, but again, AI-generated content is highly personal if you do it right. This also goes for reviews. I think we do a disservice to the research community if we do not use agents for reviewing, since they can improve quality dramatically.

Where Agents Fail: A Study of Email Automation

I alluded previously that I tried to automate emails for a long time. Over two months, I worked quite a bit on automating emails — for one, because I do not like emails, and also because it is now a major part of my work.

I wanted to build a system that helps me manage, prioritize, and draft emails. Probably for most people, the process of “doing your emails” is similar: Categorize emails into urgency, map out information that is needed to make a reply, and prioritize replies with the time that you have now until your next meeting or other event.

Doing this manually is very simple and fast. I can often look at the title and immediately sort it into a category. I can skim an email for 10 seconds and know if I need to reply now or if it has time. I can organize emails in bulk to review later.

My initial attempt at email automation was very focused on features. Can I do this categorization, prioritization, bulk sorting, and get the gist of an email with agents?

But here is the issue: even if you automate all of this, you still have a similar workflow. If you categorize an email automatically, you still need to look at the categorization to see if there are new emails. If you have an AI summary of an email, you still need to read it. If you create agent-generated drafts, you need to look at the drafts and see if it has the right details, the right tone, and actually say what you wanted to say more broadly.

Furthermore, Gmail is a familiar interface. You know where everything i,s and all these things of prioritization, categorization, etcetera, can be done easily and quickly. If an AI does that, many things are automated, but you still need to use a user interface. This interface may be unfamiliar, not optimized for all workflows, or might miss crucial information or features. And navigating and using an agent-driven email system costs time, just like how it costs time do it manually.

Here, the process optimization view kicks in. If I can categorize an email within five seconds, that is pretty fast. An AI agent needs to beat that in five seconds and be more precise than I am for it to actually be useful. While the reading and categorization can happen in the background, with an AI-generated draft, I still need to navigate to that draft and read it. That might take 10 to 30 seconds just for navigation and reading, but an additional 1 minute for editing the draft. In many cases, the manual approach is about equally fast. But if you add the development time for this system (it was more than 100 hours), it becomes clearly net negative in terms of productivity to use the agentic system.

Despite all these edge cases, I did not want to give up. For one, I really do not like emails. But the second part is that, for me, it was a challenge: can I automate this task? And if I cannot, it would serve as a hard-won lesson for future automation challenges.

So I made a second attempt. I knew about the process. I knew about the importance of interfaces and how to structure information. Since I am an avid Vim user, I wanted to build a vim-optimized interface. This was a long process — co-designing functionality, agents, and the user interface. My productivity using the agentic email system improved day by day, but at some point I saw the improvement plateauing, and I asked: Is Gmail, if I use it the right way, faster?

So I compared time spent on emails between the tool I created and just using Gmail – which is very much the process optimization view of having a stopwatch on the factory floor. What I found is that just using Gmail is faster. I could not get any degree of automation improvement by using agents for emails.

That was a very important lesson. Sometimes you fail, and that failure teaches you something valuable for the next challenge.

Conclusion

If you take away one thing from this blog post, let it be this: agent use is a skill, and like any skill, it requires deliberate practice, understanding of when it applies, and acceptance that you will fail often before you succeed.

The hype is real in some domains and misleading in others. Software engineering parallelization is real but not generalizable. The personal nature of AI-generated content is real and profound. The need for process thinking before automation is real and often ignored.

I hope these perspectives have been useful to help you think about how you can use agents, where agents work well, and what hype and what is not. The key is to think carefully, experiment often, and build skills for the long term. I hope this blog post will help you to make agents your own and see more and more benefits from agent-use.

The post Use Agents or Be Left Behind? A Personal Guide to Automating Your Own Work appeared first on Tim Dettmers.

Why AGI Will Not Happen

Tim Dettmers — Wed, 10 Dec 2025 15:05:30 +0000

If you are reading this, you probably have strong opinions about AGI, superintelligence, and the future of AI. Maybe you believe we are on the cusp of a transformative breakthrough. Maybe you are skeptical. This blog post is for those who want to think more carefully about these claims and examine them from a perspective that is often missing in the current discourse: the physical reality of computation.

I have been thinking about this topic for a while now, and what prompted me to finally write this down was a combination of things: a Twitter thread, conversations with friends, and a growing awareness that the thinking around AGI and superintelligence is not just optimistic, but fundamentally flawed. The purpose of this blog post is to address what I see as very sloppy thinking, thinking that is created in an echo chamber, particularly in the Bay Area, where the same ideas amplify themselves without critical awareness. This amplification of bad ideas and thinking exhuded by the rationalist and EA movements, is a big problem in shaping a beneficial future for everyone. Realistic thought can be used to ground where we are and where we have to go to shape a future that is good for everyone.

I want to talk about hardware improvements, AGI, superintelligence, scaling laws, the AI bubble, and related topics. But before we dive into these specific areas, I need to establish a foundation that is often overlooked in these discussions. Let me start with the most fundamental principle.

Computation is Physical

A key problem with ideas, particularly those coming from the Bay Area, is that they often live entirely in the idea space. Most people who think about AGI, superintelligence, scaling laws, and hardware improvements treat these concepts as abstract ideas that can be discussed like philosophical thought experiments. In fact, a lot of the thinking about superintelligence and AGI comes from Oxford-style philosophy. Oxford, the birthplace of effective altruism, mixed with the rationality culture from the Bay Area, gave rise to a strong distortion of how to clearly think about certain ideas. All of this sits on one fundamental misunderstanding of AI and scaling: computation is physical.

For effective computation, you need to balance two things. You need to move global information to a local neighborhood, and you need to pool multiple pieces of local information to transform old information into new. While the complexity of local computation is virtually constant — much accelerated by smaller transistors — movement scales quadratically with distance to local computation units. While memory movement also benefits from smaller transistors, improvements become quickly sublinear due to the squared nature of memory access patterns.

This is most easily seen by looking at cache hierarchies. L1, L2 and L3 cache are physically the same technology, but computationally they are very different. L2 and L3 are much larger than L1, but they are also much slower. This is because L2 and L3 are further away, physically, from the computational core, and memory lookups need to traverse a longer distance due to the physical size.

Two ideas to remember: First, larger caches are slower. Second, as we get smaller and smaller transistors, computation gets cheaper, but memory becomes more expensive, relatively speaking. The fraction of silicon area dedicated to memory on a chip has increased over time to the point where now computational elements on a chip are trivial in proportion. Almost all area is allocated to memory. In other words, if you want to produce 10 exaflops on a chip, you can do that easily — but you will not be able to service it with memory, making it useless FLOPS (the NVIDIA marketing department is good at ignoring this fact). All of this makes AI architectures like the transformer fundamentally physical. Our architectures are not abstract ideas that can be developed and thrown around carelessly. They are physical optimizations of information processing units.

To process information usefully, you need to do two things: compute local associations (MLP) and pool more distant associations to the local neighborhood (attention). This is because local information alone only helps you to distinguish closely related information, while pooling distant information helps you to form more complex associations that contrast or augment local details. The transformer is one of the most physically efficient architectures because it combines the simplest ways of doing this local computation and global pooling of information. The global pooling of information might be made more effective through research, and there is still active investigation going on that I think might be promising, but it has diminishing returns — the transformer architecture is close to physically optimal.

Computation is physical. This is also true for biological systems. The computational capacity of all animals is limited by the possible caloric intake in their ecological niche. If you have the average calorie intake of a primate, you can calculate within 99% accuracy how many neurons that primate has. Humans invented cooking, which increased the physically possible caloric intake substantially through predigestion. But we reached the physical limits of intelligence. When women are pregnant, they need to feed two brains, which is so expensive that physically, the gut cannot mobilize enough macronutrients to keep both alive if our brains were bigger. With bigger brains, we would not be able to have children — not because of the birth canal being too small, but because we would not be able to provide enough energy — making our current intelligence a physical boundary that we cannot cross due to energy limitations.

We are close to reaching the same limits for digital computation.

Linear Progress Needs Exponential Resources

There have been studies about progress in all kinds of fields that come to the same conclusion: linear progress needs exponential resources. What does that mean? If you want to improve a system further and further, make it more precise, or improve its efficiency, you need exponentially more resources with any improvement that you make. This is true for all kinds of fields and problems being investigated, and it is pretty clear why.

There are two realities at play here: one physical and one in the idea space. In the physical reality, if you need to accumulate resources in time and space to produce an outcome, then for logistical reasons, the overall effect that is locally produced needs linear resources to produce a linear outcome. But because of physicality and because matter takes up space, those resources can only be pooled at an increasingly slowing rate due to contention in space or time.

In the idea space, there is a similar phenomenon, which is less obvious. If two ideas are completely independent, they can have an effect that is ten times larger than any single idea. But if ideas are related, then the overall impact is limited due to diminishing returns — the ideas are just too correlated. If an idea builds on another, it can only be so much better. Often, if there is a dependency between ideas, one is a refinement of the other. Refinements, even if they are extremely creative, will yield incremental improvements. If a field is large enough, even if one tries to work on very different ideas, they are still heavily related to previous ideas. For example, while state-based models and Transformers seem like very different approaches to attention, they concentrate on the same problem. Very minimal gains can be achieved through any idea that modifies attention in these ways.

These relationships are most striking in physics. There was a time when progress could be made by individuals – not so much anymore.

I talked to a top theoretical physicist at a top research university, and he told me that all theoretical work in physics is, in some sense, either incremental refinement or made-up problems. The core problem of the idea space is this: if the idea is in the same sub-area, no meaningful innovation is possible because most things have already been thought. A first urge is to look for wildy creative ideas, but the problem is that are still bound by the rules of that subspace that often exist for a very good reason (see graduate-student-theory-of-everything-phenomenon). So the theoretical physicist faces only two meaningful choices: refine other ideas incrementally, which leads to insignificant impact; or work on rule-breaking unconventional ideas that are interesting but which will have no clear impact on physical theory.

The experimental physics demonstrates the physical limitations. The experiments that test more and more fundamental laws of physics and constituent particles — in other words, the standard model — become increasingly expensive. The standard model is incomplete, and we do not know how to fix it. Higher energies at the Large Hadron Collider have only led to more inconclusive results and the ruling out of more theories. We have no understanding of what dark energy or dark matter is, even though we build increasingly complex experiments that cost billions of dollars. The reality might be that certain aspects of physics are unknowable, hidden by complexity that cannot be attained with the resources that we can muster.

If you want to get linear improvements, you need exponential resources.

GPUs No Longer Improve

One of the most common misconceptions I see is that people assume hardware keeps improving and improving. This is an important misconception that explains a lot of the poor thinking around AI progress. The efficiency of GPUs has driven almost all innovation in AI. AlexNet was only possible by developing one of the first CUDA implementations that could compute convolutions over networked GPUs. Further innovation was mostly possible through improved GPUs and using more GPUs. Almost everybody sees this pattern — GPUs improve, AI performance improves — and it is easy to think that GPUs will improve further and will continue to improve AI outcomes. Every generation of GPUs has been better, and it would seem foolish to think that it will stop. But actually, it is foolish to think that GPUs will continue to improve. In fact, GPUs will no longer improve meaningfully. We have essentially seen the last generation of significant GPU improvements. GPUs maxed out in performance per cost around 2018 — after that, we added one-off features that exhaust quickly.

The first of these one-off features was 16-bit precision, then Tensor Cores, or the equivalent, then high-bandwidth memory (HBM),then the TMA or equivalent, then 8-bit precision, then 4-bit precision. And now we are at the end, both in the physical and the idea space. I have shown in my paper about k-bit inference scaling laws what data types with particular block sizes and computational arrangements are optimal. This has already been adopted by hardware manufacturers. Any further improvement will lead not to straightforward improvements but to trade-offs: either better memory footprint at lower computational efficiency or higher computational throughput at higher memory footprint. Even if you can innovate – linear improvements, need exponential resources – further improvements will be trivial and will not add any meaningful advancement.

While GPUs can no longer improve meaningfully, rack-level optimizations are still critically important. Efficient shuttling of key-value caches is one of the most important problems in AI infrastructure. The current solution to this problem, however, is also relatively straightforward. Companies like OpenAI boast about their AI infrastructure, but it is relatively simple to design because there is essentially only one optimal way to design it. And while it is complex to implement, it just needs clear thinking and mostly hard, time-intensive engineering. But the overall system design is not particularly novel. OpenAI – or other frontier labs – have no fundamental advantage in their inference and infrastructure stacks. The only way to gain an advantage is by having slightly better rack-level hardware optimizations or data-center-level hardware optimizations. But these will also run out quickly – maybe 2026, maybe 2027.

Why Scaling Is Not Enough

In my Twitter thread, I talked about how Gemini might signal a plateau in AI progress in the sense that we might not see meaningful improvements anymore. A lot of people responded with something along the lines of, “You are being too pessimistic. Can you not see that scaling works?” The point here is a bit more subtle, so I want to elaborate.

I believe in scaling laws and I believe scaling will improve performance, and models like Gemini are clearly good models. The problem with scaling is this: for linear improvements, we previously had exponential growth as GPUs which canceled out the exponential resource requirements of scaling. This is no longer true. In other words, previously we invested roughly linear costs to get linear payoff, but now it has turned to exponential costs. That would not be a problem on its own, but it sets a clear physical limit on scaling that is rapidly approaching. We have maybe one, maybe two more years of scaling left because further improvements become physically infeasible. The scaling improvements in 2025 were not impressive. Scaling in 2026 and 2027 better work out better.

Despite these exponential costs, the current infrastructure build-out is reasonable, particularly with the growth of inference use, but it still creates a very precarious balance. The biggest problem is this: if scaling does not provide much larger improvements than research/software innovations, then hardware becomes a liability and not an asset.

Small players like MoonshotAI and Z.ai show that they do not need many resources to reach frontier performance (I personally prefer Kimi K2-thinking over Sonnet 4.5 for coding). If these companies innovate beyond scale, they might just create the best model. While they might still use existing infrastructure, they could just switch to Huawei Ascend chips for inference, which are more than fine for providing good inference performance.

Another big threat to scale-up-infrastructure is that, currently, large-model inference efficiency is strongly related to a large user base due to network scaling. The problem is that efficient deployments of a large model needs a certain amount of GPUs to be efficient enough to overlap computation with networking and KV-cache length partitioning. Such deployments are ultra-efficient but demand a large user base to unlock full utilization and with that, cost-effectiveness. That is why open-weight models currently have not had the expected impact, because the infrastructure cost of large deployments need a large user-base. However, this problem can be solved with software.

While vLLM and SGLang currently try to optimize frontier-type deployments, they do not provide this efficiency at smaller scales. With the right inference stack beyond vLLM/SGLang, people would be able to deploy a ~300-billion-parameter model with the same efficiency as OpenAI or Anthropic deploys their frontier models. If smaller models become more capable — we see this with GLM 4.6 — or if AI applications become more specialized, the infrastructure advantage of frontier labs might vanish overnight. The software complexity evaporates, and open-source, open-weight deployments might be close to physically optimal, both in terms of computational efficiency and information processing efficiency. This is a large risk for frontier players.

Under slowing scaling, any of these three factors might degrade the value of AI infrastructure significantly and rapidly: (1) research/software innovations, (2) strong open-weight inference stacks, (3) shift to other hardware.

The current trends do not look good for frontier labs.

Frontier AI Versus Economic Diffusion

The US and China follow two different approaches to AI. The US follows the idea that there will be one winner who takes it all – the one that builds superintelligence wins. Even coming short of superintelligence of AGI, if you have the best model, almost all people will use your model and not the competition’s model. The idea is: develop the biggest, badest model and people will come.

China’s philosophy is different. They believe model capabilities do not matter as much as application. What matters is how you use AI. The key indicator of progress is how much AI is integrated into everything and how useful it is. If one model is better than another, it does not automatically mean it will be used more widely. What is important is that the model is useful and yields productivity gains at a reasonable cost. If the current approach is more productive than the previous one, it will be adopted. But hyper-optimization for slightly better quality is not very effective. In most cases, settling on “good enough” yields the highest productivity gain.

I think it is easy to see that the US philosophy is short-sighted and very problematic — particularly if model capability slows. The Chinese philosophy is more long-term focused and pragmatic.

The key value of AI is that it is useful and increases productivity. That makes it beneficial. It is clear that, similarly to computers or the internet, AI will be used everywhere. The problem is that if AI were just used for coding and engineering, it would have a very limited impact. While a lot of economic activity is supported by digital programs, these also have diminishing returns, and producing more software will not improve outcomes significantly if existing software is already good enough (just look at the SAAS failure in China). This makes wide-spread economic integration absolutely vital for AI effectiveness.

So in order to provide real value, AI needs to be used in ways that provide new benefits, not just improvements to what already exists. This is a difficult problem, but the right answer is to integrate AI into everything to squeeze out non-linear improvements, see what works and what does not, then keep what is working. China is taking this approach by subsidizing applications that use AI to encourage adoption. The Chinese population is very receptive to innovation, which facilitates this process. It is nothing unusual in China to see an 80-year-old grandma use AI to help her with their daily life. The US, on the other hand, bets on ideas like AGI and superintelligence, which I believe are fundamentally flawed concepts that have little relevance to future AI progress. This becomes clear when you think carefully about what these terms actually mean in physical reality.

AGI Will Never Happen, and Superintelligence Is a Fantasy

There is this pattern I have noticed: when you ask people in the Bay Area when AGI will happen, they always say it is a few years in the future, and it will have a massive impact. Then, if you ask them what AGI actually is, they do not include any physical tasks in their definition, and they do not consider resource inputs.

True AGI, that can do all things human, would need to be able to physical tasks – which comprises the largest economic sector. In short, AGI should include physical robots or machines that are able to do economically meaningful work in the physical world. While physical robots might be convenient for unloading your dishwasher, you will not see them replacing specialized systems in factories. Specialized robots in factories are too efficient, too precise. China demonstrates that dark factories — fully automated facilities — are already possible. Most robotics problems are solved problems in controlled environments. Most existing robotics problems that remain unsolved are also economically unviable. Stitching sleeves to a t-shirt is an unsolved robotics problem, but it is also not particularly economically meaningful in most contexts. Household robots will be interesting, but if it takes me two minutes to unload my dishwasher, I am not sure I need a robot for that. And while in a couple of years a robot might be able to fold laundry, I would rather spend a few minutes folding it myself with no creases than have a robot do a mediocre job.

The main problem with robotics is that learning follows scaling laws that are very similar to the scaling laws of language models. The problem is that data in the physical world is just too expensive to collect, and the physical world is too complex in its details. Robotics will have limited impacts. Factories are already automated and other tasks are not economically meaningful.

The concept of superintelligence is built on a flawed premise. The idea is that once you have an intelligence that is as good or better than humans — in other words, AGI — then that intelligence can improve itself, leading to a runaway effect. This idea comes from Oxford-based philosophers who brought these concepts to the Bay Area. It is a deeply flawed idea that is harmful for the field. The main flaw is that this idea treats intelligence as purely abstract and not grounded in physical reality. To improve any system, you need resources. And even if a superintelligence uses these resources more effectively than humans to improve itself, it is still bound by the scaling of improvements I mentioned before — linear improvements need exponential resources. Diminishing returns can be avoided by switching to more independent problems – like adding one-off features to GPUs – but these quickly hit their own diminishing returns. So, superintelligence can be thought of as filling gaps in capability, not extending the frontier. Filling gaps can be useful, but it does not lead to runaway effects — it leads to incremental improvements.

Furthermore, the same people who think that GPUs will infinitely improve are often the people who think superintelligence will make those improvements faster and better. But they do not realize that GPUs can no longer be meaningfully improved. We can wait for better HBM memory technology for speed, and for chiplets and advanced packaging to improve yield/cost, but that is it. Rack-level optimization will likely hit the physical wall in 2026 or 2027. A superintelligence will not accelerate the progress made in HBM development, manufacturing, testing, and integration. The transformer architecture is close to physically optimal. Superintelligence will not be able to meaningfully improve neural network architectures. Efficient large-scale deployments for inference are largely a solved engineering problem. It just needs some careful engineering and time, but very little creativity is required to solve this problem close to physical optimality. Superintelligence will not be able to improve our inference stack by much.

A superintelligence might help with economic diffusion of AI technology, but in the end, the limiting factor of economic diffusion is implementation and adoption, not capability. It is clear to me that any organization that strives primarily for superintelligence as a goal will encounter significant challenges and will ultimately falter and be displaced by players that provide general economic diffusion.

In summary, AGI, as commonly conceived, will not happen because it ignores the physical constraints of computation, the exponential costs of linear progress, and the fundamental limits we are already encountering. Superintelligence is a fantasy because it assumes that intelligence can recursively self-improve without bound, ignoring the physical and economic realities that constrain all systems. These ideas persist not because they are well-founded, but because they serve as compelling narratives in an echo chamber that rewards belief over rigor.

The future of AI will be shaped by economic diffusion, practical applications, and incremental improvements within physical constraints — not by mythical superintelligence or the sudden emergence of AGI. The sooner we accept this reality, the better we can focus on building AI systems that actually improve human productivity and well-being.

The post Why AGI Will Not Happen appeared first on Tim Dettmers.