Welcome to the USEbench - a meta benchmark unifying multiple software engineering benchmarks behind a neat API.
We currently support the following benchmarks:
| Benchmark | Task | Concept | Solving Criteria | # | Source |
|---|---|---|---|---|---|
| Primitive Tasks | |||||
| SWE-Verified | Program Repair | Adjust a program according to an issue. | Test-Suite Pass | 500 | SWE-Verified |
| SWT-bench-Lite | Regression Testing | Produce a test that asserts a required change, as presented in an issue. | Patch-Coverage | 298 | SWT |
| REPOCOD-Lite | Code Generation | Generate a method body, matching the functionality outlined in documentation. | Test-Suite Pass | 200 | REPOCOD |
| REPOTEST-Lite | Test Generation | Test a given method to reach 100% coverage. | Code-Coverage | 173 | novel |
| Compound Tasks | |||||
| SWETRY | Test Generation, Program Repair | Enrich the SWE-Task with a previously failed, but promising patch. | Test-Suite Pass | 100 | novel |
| REPOFEAT | Code & Test Generation | Add both tests and code, evaluate against the newly produced code coverage. | Code-Coverage | 46 | novel |
for a total of 1318 datapoints, with descriptions and an executable (docker) environment. The benchmarks are intentionally chosen subsets from the larger bodies - mostly because we consider computational limitations and costs. We currently do not support/host a leaderboard.
Requirements
- Python >=3.12
- Docker (we used 27.4)
-
200gb Disk Space for Full Gold Run Recommended
Installation
pip install -e .Creating Dataset
cd usebench/migration
python migration.pyThis will pyarrow dataset with all datapoints in ./data.
Datapoint Preparation(Optional, but recommended)
It is possible to prepare docker images before using them. This can be done with a provided script:
python usebench/scripts/prepare_uid.py ./data repocod_astropy__astropy-2 swe_django__django-10914To prepare all images:
nohup sh -c 'xargs -a ./dev/all-ids.txt python3 usebench/scripts/prepare_uid.py ./data --workers=2 ' > preparation.log 2>&1 &
tail -f preparation.logIf images are not prepared, they will be build "on demand" which will work, but cause high startup times when first encountering a new uid.
Prediction requires a json-file similar to the one specified in SWE-Verified,
and examples can be found e.g. in our test files.
(Migration has to be done first) To run a minimal (=few datapoints from each harness) example:
cd usebench/evaluation
python evaluate.py --run_id quick --predictions_path ../test-predictions/small-set-of-gold-predictions.jsonTo run on the gold standard:
python evaluate.py --run_id gold --predictions_path gold
The intended use for USEBench is by using the usebench.api.runner, like this:
import usebench.api.runner as usebench_runner
uid = "swe_django__django-10914"
data_folder = "path/to/your/data/after/migration"
command = "echo 'hello world'"
result = usebench_runner.run_command(uid=uid,command=command,dataset_path_or_name=data_folder)
print(result)The runner supports (public) commands for getting containers, resetting containers, applying patches and running commands. While there are sub-benchmark-runners (e.g. SWERunner), these will be invoked by the central runner based on UID. You can see more elaborate examples in the tests and tests for downstream benchmarks.
For inspection and access to fields from python, you can consider id_management to get an USEBenchEntry,
containing all relevant information.
See Development.md for examples of this.
If you plan on development with USE-Bench, or have troubles, please look into the Resources-Directory.
There is an extra file with more command examples under Development.md.
This repository is under the MIT License.
Some dependencies follow a different licensing scheme, see attributions for a detailed overview.
This work is part of a larger effort towards software engineering agents, please cite:
@misc{applis2025unifiedsoftwareengineeringagent,
title={Unified Software Engineering agent as AI Software Engineer},
author={Leonhard Applis and Yuntong Zhang and Shanchao Liang and Nan Jiang and Lin Tan and Abhik Roychoudhury},
year={2025},
eprint={2506.14683},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2506.14683},
}