Skip to content

TIGER-AI-Lab/SWE-QA-Pro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

📖 arXiv | 🤗 SWE-QA-Pro Bench | 💻 GitHub


📢 News

  • 🚀 [2026-05-19] The evaluation code is released — see eval/ for setup and reproduction instructions. Training code and model checkpoints will be released soon.
  • 🎉 [2026-04-07] Our paper has been accepted to the Findings of ACL 2026!
  • 🔥 [2026-03-20] SWE-QA-Pro Bench is publicly released! See our paper and benchmark.

📘 Introduction

SWE-QA-Pro is a benchmark and training framework for agentic repository-level code understanding, enabling models to explore, reason over, and verify real-world codebases. This work targets key limitations in existing evaluations:

  • Limited diversity: benchmarks focus on popular repositories, missing long-tail software tasks
  • Knowledge leakage: many questions can be solved without interacting with the codebase
  • Weak tool necessity: unclear whether agentic workflows are actually required

To address these challenges, we introduce two key components:

  1. SWE-QA-Pro Benchmark:

    Alt text A repository-level QA benchmark built from diverse long-tail repositories with executable environments.

    • Questions are seeded from real issues and grouped via clustering to ensure topic diversity
    • Each item is grounded in actual code with human verification
    • A difficulty calibration pipeline filters out questions solvable by direct-answer models

    This results in a benchmark where agentic exploration is necessary, with up to a ~13-point performance gap between tool-using agents and direct answering.

  2. Agentic Training Pipeline & Models:

    Alt text A scalable framework for learning repository-level agentic reasoning.

    • Generates synthetic tool-use trajectories and grounded supervision
    • Trains models with a two-stage recipe (SFT → RLAIF)
    • Enables small open models to learn multi-step reasoning, tool usage, and code navigation

    Models trained with this pipeline achieve strong performance, with our SWE-QA-Pro 8B model surpassing GPT-4o by +2.3 points on SWE-QA-Pro and substantially narrowing the gap to state-of-the-art proprietary models.

🛠️ TODO

  • Release the dataset
  • Release the evaluation code
  • Release the model
  • Release the training code

⚙️ Train & Eval

  • Evaluation — code for running both Direct mode and Agent mode on the SWE-QA-Pro Bench (with a unified OpenAI/DeepSeek judge) lives under eval/. The runner supports GPT-4o, GPT-4.1, Claude Sonnet 4.5, Gemini 2.5 Pro, DeepSeek v4, and local vLLM models (Qwen3, Devstral, Llama 3.3) — backends are configured declaratively in eval/configs/models.yaml, no code changes needed to add or swap a model. See eval/README.md for installation, supported models, run commands, and the package layout.
  • Training — Come soon.

📬 Contact


📖 Citation

BibTeX:

@article{cai2026sweqapro,
      title={SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding}, 
      author={Songcheng Cai and Zhiheng Lyu and Yuansheng Ni and Xiangchao Chen and Baichuan Zhou and Shenzhe Zhu and Yi Lu and Haozhe Wang and Chi Ruan and Benjamin Schneider and Weixu Zhang and Xiang Li and Andy Zheng and Yuyu Zhang and Ping Nie and Wenhu Chen},
      journal={arXiv preprint arXiv:2603.16124},
      year={2026},
}

🙏 Acknowledgement

We thank the authors of SWE-QA-Bench for open-sourcing their codebase, which served as a useful implementation reference for this project.

About

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding [ACL 2026]

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors