elizaOS's SWE-bench Lite harness. The canonical single-shot flow lives in
cli.py (entry: python -m benchmarks.swe_bench …). See RESEARCH.md
for design notes and historical results.
harness/comparison.py runs the same N SWE-bench Lite instances through
two paths and emits a side-by-side JSON report:
- Path A — elizaOS uses the existing canonical bridge (
cli._run_instance): prompt the TS bench server, extract a unified diff, grade withSWEBenchEvaluator. - Path B — opencode clones the target repo at
base_commitinto a per-instance sandbox, invokesopencode run "<task>"in that workdir, then captures the working-tree diff viagit diffand grades it with the sameSWEBenchEvaluator. Ifopencodeis not onPATH, each Path B record is markedstatus="skipped_opencode_missing"and the run continues.
Both paths share dataset loading, sandboxing, and grading so the only honest delta is the patch producer.
# Stub-only — emits the report schema with placeholder entries, no Docker,
# no eliza bridge, no opencode call. Use this to inspect the JSON shape.
python -m benchmarks.swe_bench.harness.comparison --n 2 --stub
# Real smoke (requires: docker, the eliza TS bench bridge, and opencode on PATH)
python -m benchmarks.swe_bench.harness.comparison --n 2
# Pin specific Lite instances
python -m benchmarks.swe_bench.harness.comparison \
--instances django__django-11099 sympy__sympy-20590comparison_<timestamp>.json (or comparison_smoke.json for --stub):
A pre-generated placeholder report lives at
harness/fixtures/comparison_smoke.json for downstream tooling that
wants to lock in the schema before the first real run.
{ "schema_version": 1, "generated_at": "<ISO-8601 UTC>", "totals": { "instances": 2, "elizaos_resolved": 0, "opencode_resolved": 0, "elizaos_wins": 0, "opencode_wins": 0, "ties_resolved": 0, "ties_failed": 2 }, "records": [ { "instance_id": "django__django-11099", "repo": "django/django", "base_commit": "<sha>", "path_a": { "path": "elizaos", "status": "resolved | failed | no_patch | error | not_run_yet", "patch": "<unified diff>", "resolved": false, "time_s": 0.0, "patch_status": "tests_passed | tests_failed | apply_failed | …", "tests_passed": [], "tests_failed": [], "error": null }, "path_b": { "path": "opencode", "...": "(same shape, plus status=skipped_opencode_missing)" }, "winner": "elizaos | opencode | tie_resolved | tie_failed" } ] }