This folder adds an offline-first, file-localization benchmark for mcp-brain using SWE-bench Lite.
The benchmark question is:
Given a real GitHub issue/problem statement, can
mcp-brainrank the files that the accepted reference patch actually modified?
It measures file-level localization quality, not whether the system can generate the final code patch.
SWE-bench Lite gives 300 real Python bug-fix tasks with repo, base_commit, problem_statement, and reference patch. The adapter extracts file-level ground truth from the reference patch.
Network is required only for the one-time preparation steps:
pip install -e .
pip install -r benchmark/requirements-benchmark.txt
python -m benchmark.adapters.swebench_lite --output benchmark/datasets/cache/swebench_lite.jsonl
python -m benchmark.prepare_repos --dataset benchmark/datasets/cache/swebench_lite.jsonl --repo-cache benchmark/reposAfter that, evaluation is offline:
python -m benchmark.run_eval \
--dataset benchmark/datasets/cache/swebench_lite.jsonl \
--repo-cache benchmark/repos \
--out benchmark/results/swebench_lite_results.json \
--report-dir benchmark/reports \
--top-k 10 \
--max-hops 2For a quick smoke run:
python -m benchmark.run_eval --limit 5 --timeout 180benchmark/results/swebench_lite_results.json: complete machine-readable resultsbenchmark/reports/swebench_lite_report.md: markdown reportbenchmark/reports/swebench_lite_report.html: readable HTML report
The harness reports:
Hit@K: at least one gold file appears in top-K predictionsRecall@K: fraction of gold files recovered in top-KMAP@K: average precision at K
Default K values: 1, 3, 5, 10.
By default the adapter excludes test files from ground truth. This makes the benchmark stricter for the product goal: predict the production files to modify from a ticket. To include tests too:
python -m benchmark.adapters.swebench_lite --include-testsmcp-brain.jsonis intentionally not required and not used.- The repo cache is not committed and can be large.
--use-semanticis optional; it requires the semantic reranker dependencies/model cache to be available locally.