After reading How to Build an Agent (or: The Emperor Has No Clothes), I thought that building a coding agent was surprisingly straightforward. The article describes it as mostly just a matter of function calling and describing those functions clearly in the prompt. But I wanted to go deeper: what do state-of-the-art coding agents look like in practice? Multi-SWE-bench, a recent paper from ByteDance, evaluates three modern approaches: Agentless, SWE-agent, and OpenHands. In this post, I will walk through each of these frameworks. Let’s see if the emperor manages to get dressed.
By the way, if you're looking for definitions, this is probably the best definition of an agent: "Agents are models using tools in a loop".
Multi-SWE-bench is a highly readable paper with many interesting insights about the performance of language models when it comes to resolving GitHub issues. The paper shows that SOTA methods that do well on the standard SWE-bench (which contains just Python problems), perform much worse on resolving issues in other languages. Based on an extensive evaluation, it shows how the resolved rate falls steeply from easy (< 15 min human fix) to hard (> 1 h) issues; on hard tasks, most LLM/agent pairs approach 0 % success. These results are somewhat different to what e.g. METR published; it indicates that the intelligence of LLMs is still very jagged.
Agentless: Demystifying LLM-based Software Engineering Agents, published in July 2024 by the University of Illinois, proposes an 'agentless' approach to resolving dev issues. Their setup was adopted by OpenAI as the go-to coding framework for o1, because they were the best-performing open-source scaffolding framework at the time. As the name suggests, this framework shines because of its simplicity; there is no function calling at all. They give three reasons for this:
Note: the above looks true, although it also looks like something models can easily get better at with more tuning! Teach models to reflect more often with specific instruction data, instruct them to check function definitions before calling them, and one should be able to make good improvements (1 and 3 will be improved). Improving the decision-making process given a large action space might be more difficult, particularly in situations where a standard approach will not work.
Based on the above, they propose an 'agentless' framework consisting of three steps: localisation, repair, and patch validation. The process is agentless as it avoids using an autonomous, multi-turn LLM “agent” that plans actions, executes tools, and reacts based on environment feedback. The tools (such as reading, editing, etc.) are hardcoded parts of the pipeline, and are not freely available to an agent.
Localisation is done in three steps: first, relevant files are localised, then relevant elements within these files are localised, and finally, edit locations are determined.
The input for the LLM to perform this step is a tree-like representation of a repository along with the issue description and the request to list the top N suspicious files that need further inspection or modification to resolve the issue. In addition to that, an embedding-based retrieval method is used. This works as follows: we first ask the LLM to produce a list of irrelevant folders that do not need to be further inspected. Then, we divide the remaining files into chunks of code segments and compute their embedding using an embedding model. We then embed the original issue description and compute the cosine similarity between the issue embedding and each chunk to retrieve a list of relevant files. We then combine the results of both the prompting and embedding-based method, resulting in a final list of relevant files.
Now that we have a list of relevant files, we want to find the relevant elements within these files. Assuming we are limited by a limited context length, we cannot just add all files to the context. So instead, we add a compressed skeleton format of each file, containing the list of classes, functions and variable declarations. With this context, we ask the LLM to provide a list of related classes and functions.
Now that we have relevant elements, potentially from different files, we ask the LLM to provide the final set of edit locations, specified by line numbers, functions or classes. The input of relevant elements obtained already is small, so we can simply add this to our context and prompt the model. Straightforward enough.
Now that we have our edit locations, we want to produce the correct code to solve the issue at hand. To give the LLM the relevant information, we add the code surrounding the edit locations to our context, and then ask it to generate patches to solve the issue. Specifically, the LLM is asked to generate a search/replace edit: a diff format:
The diff format avoids generating the complete code but focuses instead on small edits. This is more cost-efficient, and more reliable as there are fewer chances to hallucinate. Agentless asks the LLM to produce multiple (40) candidate patches per issue.
In this step, we finally execute some code. Because we have multiple patch candidates, we want to validate which one works. We cannot rely on bug-triggering tests -- these are not available in a realistic scenario where we just have a user reporting a bug without concrete tests to reproduce it. Fortunately, we can usually generate these tests! Again, we generate multiple (40) tests, normalise them (remove spaces and comments, normalise test names), and keep the tests that are most frequently occurring.
Now, before we run the newly generated tests, we want to make sure that other tests are passing. So we run all other tests, but ask the LLM to exclude the tests that should not be ran if the issue has been correctly fixed (as they might fail as we change the issue). To me, this initially seems somewhat counterintuitive as excluding tests related to the issue might remove the ability to test current, desired behaviour related to the issue. I guess the idea is that because we generated tests for the issue, we should not rely on older tests that were not able to catch whatever was mentioned in the issue report; and these tests could in fact enforce/test the wrong behaviour.
Now, to select the relevant patch, we run all tests and keep the patch with the lowest number of regression failures. We then run the newly created tests and again only keep the patches that pass these tests. If no patch can pass these new tests, we only use the regression tests to select the patch. If multiple tests pass all tests, we normalise the code (remove comments/spaces/newlines) and keep the patch that has the most amount of counts among the normalised patches (I guess we can assume that after normalisation some candidates are the same).
And that's it! It's a relatively simple approach, but with many generated candidate tests and patches. However, it seems to work well; this setup obtains 32% on SWE-bench Lite with GPT-4o (gpt-4o-2024-05-13
), which was the state of the art for open-source tools in 2024.
SWE-Agent, published in May 2024 by Princeton University, does use a more agentic framework. Their approach with GPT-4 Turbo leads to a 12.4% accuracy on SWE-bench.
Note: this is not the SWE-bench Lite version evaluated by Agentless. Although the Lite version is reportedly designed with a 'similar difficulty spectrum', it focuses on more self-contained bugs, which should make it easier compared to the full dataset?
SWE-Agent uses an agent-computer interface (ACI), which means that they use an additional LLM that can perform certain actions, such as viewing, searching or editing files. To create the right context for the ACI, they manually inspect its behaviour and run a grid search to select the best ACI configuration. Given this analysis, they come up with the following guidelines:
Now, what can the agent do? Let's describe the given functions. For each function, the agent generates a thought and a command, and then incorporates the feedback form the command's execution in the environment. The following actions are possible:
scroll_down
and scroll_up
, or access a specific line with the goto
command. To guide the agent, these commands display the full path to open the file, the number of lines in the file, number of lines omitted before and after the current window, and the file line numbers.In addition to these specific commands, the agent also has access to common Linux commands.
File editing is difficult. The file editing scaffolding is critical -- without it (but with e.g. sed
for editing files), performance drops by 7.7 percentage points. Without the linting feedback, and the process of discarding edits in case of linting errors, the performance drops by 3 percentage points. Regardless, editing remains challenging; 51.7% of the executed SWE-bench tasks had 1+ failed edit. As such, any attempt at editing has a 90.5% chance of eventually being successful, which drops to 57.2% after a single failed edit.
Most failures are incorrect implementations. See circle diagram below; these categorisations are made by GPT-4.
Agents succeed quickly and fail slowly. Successful runs are completed earlier and at a cheaper cost than unsuccessful ones. 93% of the resolved instances are submitted before exhausting their cost budget. This suggests that increasing token or budget limits will likely not substantially improve performance. It looks like longer runs just increase the chance of making errors, resulting in a lower resolved rate.
OpenHands: An Open Platform for AI Software Developers as Generalist Agents, first published in July 2024, describes an open-source project that consists of a range of coding agent features such as multi-agent delegation, agent evaluation and more general interaction mechanisms. OpenHands focuses on creating a more general platform for 'digital agents' rather than optimising a setup for a specific task. As a result, the OpenHands paper is mostly a list describing available features (e.g. access to a Chromium brower, a Jupyter IPython environment, Docker image support, etc.) rather than a specification of how to best run an agent. Since listing features is not the purpose of this blog post, I'll refer to the paper if you want to know more. As OpenHands is more general, it might be less optimised for Python or SWE-bench, and therefore perform slightly better on other coding languages, as can be seen in the Multi-SWE-bench paper:
Reading the papers, I had to think of this section from the Dwarkesh podcast with Trenton Bricken and Sholto Douglas where Sholto answers what limits model performance right now:
I think what we're seeing now is closer to: lack of context, lack of ability to do complex, very multi-file changes… sort of the scope of the task, in some respects. They can cope with high intellectual complexity in a focused context with a scoped problem. When something's a bit more amorphous or requires a lot of discovery and iteration with the environment, with this kind of stuff they struggle more. Maybe the way I would define the thing that's holding them back like this. If you can give it a good feedback loop for the thing that you want it to do, then it's pretty good at it. If you can't, then they struggle a bit.
Something similar can be seen in the papers summarised above: it is scoping the task (deciding on the right action) and many environment interactions where models are currently failing. Hence, all frameworks:
My current, maybe naive take is that scaling reinforcement learning on specific tasks will go a long way to resolve these issues. Even though Multi-SWE-bench shows limited progress in languages other than Python, the progress on SWE-bench is truly astounding - and this is, for as far as I can tell, mostly because of scaling RL; allowing the model to discover how to best take actions given an action space and environment. From this perspective, RL offers so much more than supervised training; it offers a way to discover the right actions; which seems to be a crucial component to fix current limitations. The environment of different coding languages varies in terms of tooling, runtime errors, memory management, but training with RL to deal with these variations should help. I suspect that as the community progresses on scaling RL, more and more of the structure included in current agent frameworks can be removed - the model will learn what to do when. For better generalisation, the key to progress may lie not in further refining heuristics, but in letting models learn through interaction what those heuristics should be.
If you liked this blog post, feel free to share it or follow me on LinkedIn or Twitter.