My Journey Towards Coding Agents: Building SERA

If you look at how people cook coding agents today, they have an industrial kitchen: large-scale reinforcement learning systems with many components for efficiency spanning hundreds of GPUs, complex repository mixing, and large teams working on all angles to optimize the data pipeline for training. For the family of Open Coding Agents we released today from Ai2, we had the equivalent of a hot plate and a frying pan: 32 GPUs and five bright-eyed researchers who wanted to cook state-of-the-art coding agents.

This blog post is about how we cooked that coding agent.

Along the way, I will share the struggles and technical details that usually do not make it into papers: switching fields, disappointing people, and feeling irrelevant. And then the failures along the way when we actually tried to cook an agent with our frying pan, to the eventual breakthrough.

Once we had our first breakthrough, we could upgrade our frying pan to a full set of pots and an induction oven (96 GPUs). The result? A method that lets you cheaply, with just a couple of GPU days, finetune a 32B model on your own private codebase to create a coding agent that rivals or exceeds the teacher model’s performance on this private data. And our software makes it easy to deploy it in the Claude Code environment for a smooth experience.

Contents hide

What This Post Covers

The Struggle

Switching Fields is Scary

Getting a Foothold: Learning From Inspecting Trajectories

The Copy Problem

The Setback

Data Generation Methods

Approach 1: Subtask Splitting

Finding Good Proxy Tasks

The Editing Step

Approach 2: End-to-End Generation

Approach 3: Soft-verified Generation – Embracing Vagueness

SERA: Soft Verified Efficient Repository Agents

Simple to Use Infrastructure Including Claude Code Integration

What This Means For You

Conclusion

What This Post Covers

This post has four parts:

1. The Struggle: Switching fields, the costs of starting over, and what it felt like to be in the middle of it.

2. Getting a Foothold: My early work on coding agents, taking an 8 billion parameter model from 0% to 24% on SWE-bench.

3. Data Generation Methods: Three approaches we tried: subtask splitting, end-to-end training, and finally SERA. If you want to skip to what actually worked, jump straight to the [SERA section](#sera-soft-verified-efficient-repository-agents). If you want to understand why we ended up there, read on.

4. What This Means For You: What private repo specialization can do for you, how to deploy these models, the Claude Code integration, and what becomes possible when coding agent research is cheap.

—

The Struggle

Switching Fields is Scary

I started in a good position. I had become an expert in quantization, developing methods such as QLoRA and k-bit inference scaling laws. These were hugely successful, particularly when I integrated them into the bitsandbytes library, which reached millions of monthly downloads. QLoRA became a staple in the industry. And all findings of k-bit inference scaling laws would eventually be implemented in Blackwell GPUs on the hardware level.

But research means extending knowledge, not just refining what you already know. While I was on the job market I already knew two things that forced me to change directions: (1) quantization research and other efficiency research hitting diminishing returns and are less exciting; (2) It was clear to me that chatbot models would not yield the productivity improvements we need. So in August 2024, I made the leap, abandoned all previous work, and started working on coding agents, which I thought was the most promising direction.

It was scary.

Switching fields from a one that you are well established in to something completely new is probably one of the hardest things you can do in research. People were excited that I was joining the Allen Institute for AI. I brought expertise with me. But after I joined, and people realized I was going to work on coding agents rather than something related to my previous expertise, they were surprised. Are you going to help me with efficient training and inference? Nope, I am going to learn about coding agents.

That felt selfish. I could really help my colleagues, but I chose not to, so I can focus on transitioning to coding agents and be helpful there instead.

That was particularly painful because I really like to help people – my main research goal remains to make the best AI available and accessible to anyone. I disappointed many people whom I liked and respected. And the costs accumulated. I had not published anything in about a year. Students in this PhD admission cycle are not excited to work with me as they saw that I produced nothing interesting. I watched as people paid less attention to me, as I became irrelevant. That is a particular kind of loneliness in research. Knowing you are working hard on something, but having nothing to show for it.

Getting a Foothold: Learning From Inspecting Trajectories

To get started on going agents I hurled myself into the depths of coding agents to build intuition and understanding that would guide my future research.

As a proponent of open source, I naturally worked with open-source models. When I looked at SWE-bench performance for an 8B model, it was basically 0%. SWE-bench is a benchmark where you receive a GitHub issue describing a bug in a codebase and need to fix it. Closed-source models at the time performed around 30%. The 8B model had 0%.

My first challenge: could I improve this 8B model as much as possible and make open source competitive?

Being resource-limited forces you to run fewer experiments, making each one more important. My first step was to understand and learning how to interpret trajectory data. So I download trajectories from frontier models and study them. I examined two things: the workflow models used to solve bugs, and the problems they encounter. This analysis helped me identify key patterns and strategies that I could adapt to improve open-source models.

Every closed-source coding model learns the same workflow, one that closely mirrors what humans do. They read the GitHub issue and try to understand where the related functions and classes are in the codebase. This leads to “rabbitholing” (as Ethan puts it), where a model goes both broad and sometimes deep into the code to find the bug and understand it thoroughly by examining how functions relate to other components. This builds the understanding that the model needs for the next steps.

The next steps can be interchanged: write a bug fix, then a replication script, or vice versa. Technically, writing a replication script first to verify the bug is better, but closed-source models sometimes interchange these steps. Then comes an iteration stage where editing continues until the replication no longer throws an error. Finally, you create a patch and submit it.

The Copy Problem

Looking at what was going wrong with small open-weight models, I found the core problem: they are very bad at copying information. This is critical for tool calling. If you want to explore a codebase, you need to reference function names exactly. These tool calling errors often confused small open-source models, where they would generate a reasoning trace of “This does not work. Let me try something else”. These small imprecisions stopped small open-weight models from getting any solutions at all.

The fix was straightforward: do a fuzzy search against existing functions so tool calls are made against the right function (instead of non-existing ones). This immediately led to an improvement from 0% to 7% solve rate. From there, I could run more trajectories and see other failures. The model sometimes failed to produce the closing tag required for tools, leading to no tool call and confusion that resulted in errors. Ensuring tags are always closed for tool calls pushed performance to 14%.

The big lesson: open-weight models were not good at copying information, while closed-source models excelled at it. Armed with this knowledge, I made further changes to make smaller models more robust when referring to symbolic information.

I pushed the 8 billion model to 24%, very close to GPT-4o at 25% for the scaffolds at that time. These quick improvements were mostly driven by my experimental methodology. I carefully estimate the variance in results so I can, with precision, say what works and what is noise. This was something I learned from writing the QLoRA paper, where I ran thousands of small experiments to verify which variables mattered at a larger scale.

This was exciting. I felt I was getting a foothold in coding agents. By November 2024, people at AI2 were interested in applying these methods to their agent workflows. Nobody at AI2 was working on coding agents specifically, but people were working on scientific agents, so I shifted to building a general agent framework. Then came the health problems, and the pause.

The Setback

When I was starting to make progress, misfortune struck. I developed health problems that forced me to take a break. While that was a difficult decision, I stopped working in February 2025.

I had committed to hiring an intern, Ethan Shen, who was highly recommended with an impressive background. Even though I could not work and program myself, I wanted to mentor him. Ethan started one day before Claude Code was released. He had not worked on coding agents before.

But Ethan learned quickly. Remarkably quickly. Together with my mentoring, he made rapid progress. A couple of months into the coding agents project, he would reach out to other researchers about coding agents, and it turned out he often had more insights into what determines good performance than they did. He became one of the most knowledgeable people in the coding agent space. I like to think I had a role in Ethan’s growth, but really, Ethan is just exceptionally good.

When Ethan and I achieved our first state-of-the-art results with the subtask-splitting method, it proved that all the struggles and setbacks were worth it. This progress can inspire you to keep pushing through your own challenges in research.

Data Generation Methods

When I fully returned to work after my health-related break, the landscape had shifted. Qwen 3 had been released with improved reasoning capabilities. Open-source models now have the precision to make mostly correct tool calls. The bottleneck was no longer scaffolding. It was data.

Working with real training data from real GitHub issues was not scalable. Data with real data is difficult, particularly when you have only a single person new to coding agents working on this. So the choice was obvious: the only scalable solution was generating synthetic data.

At the time, most people wanted to generate correct training data: find GitHub issues with bugs, which gives you the final solution (the patch), the faulty state (the bug), and the repository. Then generate synthetic training data from a larger model’s rollout, which you use to fine-tune a smaller model.

For efficient synthetic data, you need a different approach. The core problem is generating labels. A simple approach: start with the label, then mutate your data to generate the example. This means: start with a correct codebase, generate a synthetic bug, then use the bug as the input and the correct codebase as the target. This makes scaling easy.

But here is where the resource constraint bit hard. Coding agents usually need massive amounts of compute. And we only had 32 GPUs at that time. This is the same problem I faced with QLoRA: you need resources to learn to be resourceful, but you do not have them yet. Eventually, it becomes a virtuous cycle, a flywheel. But first, you need to get it spinning but getting some level of efficiency to make your resources sufficient for further progress.

We tried three approaches. Each one taught us something that led to the next.

Approach 1: Subtask Splitting

Code search is one of the most important problems in coding agents. If you cannot find the bug, you cannot solve it. If you find it, you have a good chance. So the most important thing to learn from a trajectory is searching the codebase. Why not learn that first, then learn editing afterwards?

This has an advantage: subtask methods might transfer across problems. Searching a codebase is useful for many digital problems. Editing is used to add features, fix bugs, and more. By splitting into subtasks, we hoped to gain efficiency that spreads to many subtasks. Doing well on search first would make learning to edit more efficient, since we could already find bugs with higher precision.

The problem: you have no labels for subtasks. But you can use a mutation method. Start with a correct codebase, mutate it into a problem state, then use the original state as the label and the mutated state as input. This generates countless tasks for learning to search. We refer to this problem as the proxy task problem: how to find a task that indirectly helps you learn what you care about.

Finding Good Proxy Tasks

How do you find which task is good for generating data? There is a simple, effective method. Take multiple models, generate a proxy task via mutation, evaluate them on both the proxy task and SWE-bench, and then compute the correlation.

A relevant task shows a correlation close to 1: better proxy task performance means better SWE-bench performance. A bad proxy task shows low correlation.

The best search proxy task that dramatically improves the performance on SWE-bench and which has almost a 1.0 correlation with SWE-bench performance is the following: take a function, have a language model summarize it, then create a task where the model gets the summary and must search the codebase to find that function. This task has almost a one-to-one correlation with SWE-bench, showing search is critical, and you can easily create tasks that mirror this.

Fine-tuning on just 500 samples from this task makes 8 billion models almost as good as frontier models at searching codebases. This heavily boosts open-weight model performance and easily reaches state-of-the-art results (at that time). This was our first state-of-the-art result.

There was a thought to publish here, but I have a tendency to push further. My thought is this: a bad paper has no impact, a good paper has some impact, a great paper has a massive impact. If you have a good result, why publish? Go for the great result instead. What if you get scooped while you try to get a great result? Not too bad. You quickly publish, and you still have a good result.

Ethan saw it the same way and we pushed on.

The Editing Step

With the search solved, the second step was editing once you found the function.

Our method: take a correct codebase, insert a bug, describe it in a GitHub issue like SWE-bench, and provide the altered codebase with the bug plus a search trace from the first subtask. The model uses the search context pointing to the correct function and learns to edit it.

Around this time, John Yang and some close friends released SWE-Smith, using the same method. The difference: they used unit tests to verify that generated bugs actually break things. We did not have resources for that infrastructure.

Despite unverified bugs, we got good results. However, while search needs only 500 examples for near-frontier performance, editing is different. It requires precise updates, so we needed many samples with many different bugs.

The procedure was efficient, as we could append the search trace and reproduce the bug. But we realized that the editing model was as computationally expensive as an end-to-end approach that combined both steps. And when we ran an end-to-end experiment, we got results nearly as good as our split task generation with the only difference being that end-to-end is much simpler. So we decided to switch to end-to-end generation and training.

Approach 2: End-to-End Generation

End-to-end generation is essentially something like SWE-smith: (1) mutate a correct codebase to get a bug; (2) create a GitHub issue; (3) generate rollouts from a model; (4) train on the data (GitHub issue, mutated codebase, trajectory).

Generating lots of data is expensive, so we settled on not verifying or running tests. We had basically only Ethan working on the project at this point, and this full verification method would not be feasible. So we first tried to do hard verification of our data and instead we settled on soft verification.

Soft verification means we trained on data that could be incorrect. We only compare the generated patch with the mutated patch, looking for partial line-by-line overlap. If the model generates a patch that overlaps with the target patch by 50%, we accept it as training data.This reduced costs significantly, since tests consume time and CPU cycles for test verification.

We also needed a cheap model for data generation. GLM 4.5-Air and 4.6 give strong SWE-bench results and are open-weight, but the full GLM 4.6 is too big for the 32 GPUs that we needed for generation, training, and evaluation. The GLM 4.5-Air model hit the sweet spot: strong performance at low cost on a few GPUs. This gave us low verification complexity and efficient generation.

Cost-effectiveness spins up a flywheel: more efficient resourceuse leads to faster experimentation, which leads to faster progress.

Since we settled on no hard verification, our method is feasible with the 32 GPUs we had at that time.

But doing this revealed something unexpected.

Approach 3: Soft-verified Generation – Embracing Vagueness

When we generate bugs, you need to provide the model a hint of what the bug is without giving it away. This is not easy. This means one often has a function as a starting point and a vague description of the bug. What we did initially is to analyze the call graph of a codebase, then find functions that build a chain func A -> func B -> func C. Then we would prompt the model with “There is a bug downstream from A”. With this vague description, the model often produces all kinds of different “bug fixes”. A bug from function reuse might lead to a solution looking more like refactoring than a bug fix. We have refactoring traces even when asking for bug fixes.

At first, this seemed like a problem. But at this point, Saurabh Shah joined the project. He built reinforcement learning (RL) infrastructure and trained a lot with RL. He told us that his is actually pretty common for RL too. Training on incorrect data often gives good performance.

So we doubled down: instead to tring to fix this incorrect data, we embraced it, and that led to something unintuitive.

Instead of trying to generate training code that is as correct as possible, we do something simpler – and something slightly wacky. Take a correct codebase, select a random function, and tell the model there is a “bug” downstream from that function (even if there is no downstream function).

To prompt a model with a correct codebase and “There is a bug somewhere here. Please find it and fix it” is ridiculous – there is no bug! But when you realize that most of the generations are never about bugs anyway, this is more sensible. And the other side of this is that it is very cheap, and very easy to do. It is just prompt + function.

The model explores from that function, looks at the code, and imagines a bug that does not actually exist. It might uncover a missing edge case, a missing assertion, inaccurate documentation, poor code that needs to be refactored, or a “real bug”. As such, while instructed to fix a bug, the model creates many other sensible improvements to the already correct code.

To keep generation cheap, we continue the pipeline as usual in the second rollout: generate a GitHub issue from that first rollout and its patch, then do another rollout starting from the GitHub issue and the correct codebase to finally create a second patch. This leads us to two trajectories: one with a GitHub issue and another without.

This generates lots of training data from just random functions in a codebase. We compare the overlap between the two patches with soft verification. Two trajectories in succession make everything compute-efficient and simple. All you need is a codebase with functions and a model that can generate trajectories through two calls.

—

SERA: Soft Verified Efficient Repository Agents

This approach gives us a much higher data limit. Typical data generation pipelines are limited by how many sensible bugs you can create. Verifying bugs to actually cause tests to fail further limits this. You also need a starting point that does not reveal the bug location. This means many other synthetic data approaches can only generate a very limited amount of data from a single repository.

We have none of these limitations. From a codebase with a thousand functions, we can generate a thousand trajectories. To amplify further, we examined papers analyzing common engineering bugs, getting a distribution of 51 bug types.

For each function, we can generate 51 different bugs. A codebase with 1,000 functions yields 51,000 trajectories easily and cheaply. We are almost not data-limited – even for a single repository.

This result leads to a core advantage of SERA: private codebase specialization.

Closed-source models do well on popular programming languages and common patterns. But on less-represented languages or proprietary codebases, they struggle. The holy grail is to specialize a model by training it on private data without exposing that data to a provider.

We generate massive amounts of data for a particular repository. Generating 7,000 trajectories takes just 19 GPU days. Training on these trajectories gives us a 32 billion model as good as the teacher, GLM 4.5-Air, the model from which we generated data. This works for other teacher models too, meaning you can compress frontier-level performance specialized to private data into a small, easily deployable model.

With enough data, we exceed the performance of the teacher model. This makes sense: we generate data from private repositories that the teacher has never seen. The student can exceed the teacher through specialized data.

This is a massive result. It helps companies quickly specialize a small model to frontier-level performance, or exceed it, on their private data.

Simple to Use Infrastructure Including Claude Code Integration

Ethan, Saurabh, and I knew we had something special, but we wanted to have a nice release and we were thin in terms of building that release. So Danny Tormoen joined us – and boy, did he do a good job!

Danny very quickly developed a converter that adapts the SWE-agent toolcalls to be compatible with the Anthropic API and convert responses back to what SWE-agent understands. Danny also created a deployment script for Modal that you can run without a credit card. You just need to create an account and you can try our agent in Claude Code for a couple of hours. All these tools are usable with 1-2 lines of code or commands. The same is true for out training and data generation pipeline.

With this, we could quickly start dogfooding our model in Claude Code. Initial impressions are “a pretty good model for its size,” but the model also has a habit of wanting to submit patches after some iterations – a clear artifact of the training procedure.

What This Means For You

With SERA you can easily and cheaply train high-performing coding agents on your personal of proprietary repositories and deploy these agents seamlessly in a Claude Code interface. The Claude Code is seamless – you can deploy with a single line of code.

A 32 billion model will not give you frontier performance right now. But it gives you a strong, specialized model for your private code, and you do not need to send your data to any provider.

More importantly, by tinkering with this, you gain skills. Once more advanced methods become available, you will be ready to use them. The trajectory of this field suggests that matching or exceeding frontier-level performance on private data is within reach. And SERA is a very exciting result in this direction.

I am very excited to see what you will build and do with SERA — the first release of Ai2’2 Open Coding Agents family — and specialized coding agents for your data.

Conclusion

Because our method is cheap, it opens coding agent research to everyone. You do not need large teams to maintain complex reinforcement learning setups. You do not need thousands of dollars to replicate a baseline.

Our baseline, which beats the previous state-of-the-art open-source method, costs $500 to run. Any lab with a few GPUs can now do coding agent research.

I would not be surprised if academic labs reproduce frontier-level coding agent performance for private coding data by the end of 2026. Open-weight models are already approaching that level and I think we can shrink everything done to 100B maybe 32B models that are super easy to deploy and use. With our method and future methods, you can specialize in your private data without exposing it.

There is a real promise here: open-source models that exceed frontier performance because you use your private data, train your own model, and deploy it yourself.

This is how we cooked our coding agent with a hot plate and a frying pan. The industrial kitchen is nice, but it is not required. Sometimes constraints force you to find simpler solutions – and sometimes those solutions turn out to be better.

Blog Posts Topics

My Journey Towards Coding Agents: Building SERA

What This Post Covers