Skip to content

nus-apr/USEbench

Repository files navigation

Unified Software Engineering Benchmark (USE-Bench)

Welcome to the USEbench - a meta benchmark unifying multiple software engineering benchmarks behind a neat API. We currently support the following benchmarks:

Benchmark Task Concept Solving Criteria # Source
Primitive Tasks
SWE-Verified Program Repair Adjust a program according to an issue. Test-Suite Pass 500 SWE-Verified
SWT-bench-Lite Regression Testing Produce a test that asserts a required change, as presented in an issue. Patch-Coverage 298 SWT
REPOCOD-Lite Code Generation Generate a method body, matching the functionality outlined in documentation. Test-Suite Pass 200 REPOCOD
REPOTEST-Lite Test Generation Test a given method to reach 100% coverage. Code-Coverage 173 novel
Compound Tasks
SWETRY Test Generation, Program Repair Enrich the SWE-Task with a previously failed, but promising patch. Test-Suite Pass 100 novel
REPOFEAT Code & Test Generation Add both tests and code, evaluate against the newly produced code coverage. Code-Coverage 46 novel

for a total of 1318 datapoints, with descriptions and an executable (docker) environment. The benchmarks are intentionally chosen subsets from the larger bodies - mostly because we consider computational limitations and costs. We currently do not support/host a leaderboard.

Scripts / Instructions

Installation + Setup

Requirements

  • Python >=3.12
  • Docker (we used 27.4)
  • 200gb Disk Space for Full Gold Run Recommended

Installation

pip install -e .

Creating Dataset

cd usebench/migration
python migration.py

This will pyarrow dataset with all datapoints in ./data.

Datapoint Preparation(Optional, but recommended)

It is possible to prepare docker images before using them. This can be done with a provided script:

python usebench/scripts/prepare_uid.py ./data repocod_astropy__astropy-2 swe_django__django-10914

To prepare all images:

nohup sh -c 'xargs -a ./dev/all-ids.txt python3 usebench/scripts/prepare_uid.py ./data --workers=2 ' > preparation.log 2>&1 &
tail -f preparation.log

If images are not prepared, they will be build "on demand" which will work, but cause high startup times when first encountering a new uid.

Evaluation

Prediction requires a json-file similar to the one specified in SWE-Verified, and examples can be found e.g. in our test files.

(Migration has to be done first) To run a minimal (=few datapoints from each harness) example:

cd usebench/evaluation
python evaluate.py --run_id quick --predictions_path ../test-predictions/small-set-of-gold-predictions.json

To run on the gold standard:

python evaluate.py --run_id gold --predictions_path gold

Using from Python

The intended use for USEBench is by using the usebench.api.runner, like this:

import usebench.api.runner as usebench_runner

uid = "swe_django__django-10914"
data_folder = "path/to/your/data/after/migration"
command = "echo 'hello world'"

result = usebench_runner.run_command(uid=uid,command=command,dataset_path_or_name=data_folder)
print(result)

The runner supports (public) commands for getting containers, resetting containers, applying patches and running commands. While there are sub-benchmark-runners (e.g. SWERunner), these will be invoked by the central runner based on UID. You can see more elaborate examples in the tests and tests for downstream benchmarks.

For inspection and access to fields from python, you can consider id_management to get an USEBenchEntry, containing all relevant information. See Development.md for examples of this.

Other Scripts and Use-Cases

If you plan on development with USE-Bench, or have troubles, please look into the Resources-Directory.

There is an extra file with more command examples under Development.md.

Licence

This repository is under the MIT License.

Some dependencies follow a different licensing scheme, see attributions for a detailed overview.

Citation

This work is part of a larger effort towards software engineering agents, please cite:

@misc{applis2025unifiedsoftwareengineeringagent,
      title={Unified Software Engineering agent as AI Software Engineer}, 
      author={Leonhard Applis and Yuntong Zhang and Shanchao Liang and Nan Jiang and Lin Tan and Abhik Roychoudhury},
      year={2025},
      eprint={2506.14683},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2506.14683}, 
}

About

Unified Software Engineering Benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5