Unified Software Engineering Benchmark (USE-Bench)

Welcome to the USEbench - a meta benchmark unifying multiple software engineering benchmarks behind a neat API. We currently support the following benchmarks:

Benchmark	Task	Concept	Solving Criteria	#	Source
Primitive Tasks
SWE-Verified	Program Repair	Adjust a program according to an issue.	Test-Suite Pass	500	SWE-Verified
SWT-bench-Lite	Regression Testing	Produce a test that asserts a required change, as presented in an issue.	Patch-Coverage	298	SWT
REPOCOD-Lite	Code Generation	Generate a method body, matching the functionality outlined in documentation.	Test-Suite Pass	200	REPOCOD
REPOTEST-Lite	Test Generation	Test a given method to reach 100% coverage.	Code-Coverage	173	novel
Compound Tasks
SWETRY	Test Generation, Program Repair	Enrich the SWE-Task with a previously failed, but promising patch.	Test-Suite Pass	100	novel
REPOFEAT	Code & Test Generation	Add both tests and code, evaluate against the newly produced code coverage.	Code-Coverage	46	novel

for a total of 1318 datapoints, with descriptions and an executable (docker) environment. The benchmarks are intentionally chosen subsets from the larger bodies - mostly because we consider computational limitations and costs. We currently do not support/host a leaderboard.

Scripts / Instructions

Installation + Setup

Requirements

Python >=3.12
Docker (we used 27.4)
200gb Disk Space for Full Gold Run Recommended

Installation

pip install -e .

Creating Dataset

cd usebench/migration
python migration.py

This will pyarrow dataset with all datapoints in ./data.

Datapoint Preparation(Optional, but recommended)

It is possible to prepare docker images before using them. This can be done with a provided script:

python usebench/scripts/prepare_uid.py ./data repocod_astropy__astropy-2 swe_django__django-10914

To prepare all images:

nohup sh -c 'xargs -a ./dev/all-ids.txt python3 usebench/scripts/prepare_uid.py ./data --workers=2 ' > preparation.log 2>&1 &
tail -f preparation.log

If images are not prepared, they will be build "on demand" which will work, but cause high startup times when first encountering a new uid.

Evaluation

Prediction requires a json-file similar to the one specified in SWE-Verified, and examples can be found e.g. in our test files.

(Migration has to be done first) To run a minimal (=few datapoints from each harness) example:

cd usebench/evaluation
python evaluate.py --run_id quick --predictions_path ../test-predictions/small-set-of-gold-predictions.json

To run on the gold standard:

python evaluate.py --run_id gold --predictions_path gold

Using from Python

The intended use for USEBench is by using the usebench.api.runner, like this:

import usebench.api.runner as usebench_runner

uid = "swe_django__django-10914"
data_folder = "path/to/your/data/after/migration"
command = "echo 'hello world'"

result = usebench_runner.run_command(uid=uid,command=command,dataset_path_or_name=data_folder)
print(result)

The runner supports (public) commands for getting containers, resetting containers, applying patches and running commands. While there are sub-benchmark-runners (e.g. SWERunner), these will be invoked by the central runner based on UID. You can see more elaborate examples in the tests and tests for downstream benchmarks.

For inspection and access to fields from python, you can consider id_management to get an USEBenchEntry, containing all relevant information. See Development.md for examples of this.

Other Scripts and Use-Cases

If you plan on development with USE-Bench, or have troubles, please look into the Resources-Directory.

There is an extra file with more command examples under Development.md.

Licence

This repository is under the MIT License.

Some dependencies follow a different licensing scheme, see attributions for a detailed overview.

Citation

This work is part of a larger effort towards software engineering agents, please cite:

@misc{applis2025unifiedsoftwareengineeringagent,
      title={Unified Software Engineering agent as AI Software Engineer}, 
      author={Leonhard Applis and Yuntong Zhang and Shanchao Liang and Nan Jiang and Lin Tan and Abhik Roychoudhury},
      year={2025},
      eprint={2506.14683},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2506.14683}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 417 Commits
.github/workflows		.github/workflows
dev		dev
resources		resources
tests		tests
usebench		usebench
.gitignore		.gitignore
.pre-commit-config-mac.yaml		.pre-commit-config-mac.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
ATTRIBUTIONS.md		ATTRIBUTIONS.md
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unified Software Engineering Benchmark (USE-Bench)

Scripts / Instructions

Installation + Setup

Evaluation

Using from Python

Other Scripts and Use-Cases

Licence

Citation

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

License

nus-apr/USEbench

Folders and files

Latest commit

History

Repository files navigation

Unified Software Engineering Benchmark (USE-Bench)

Scripts / Instructions

Installation + Setup

Evaluation

Using from Python

Other Scripts and Use-Cases

Licence

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages