Skip to content

microsoft/SWE-Bench-Mutated-CAIN26

SWE-Bench-Mutated

Official code release for Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation, accepted for publication at CAIN 2026.

This repo provides a CLI that rewrites SWE-Bench prompts using an LLM and saves a dataset which can be used downstream for agent inference.

Quick Start

git clone https://github.com/microsoft/swebench-mutate.git
cd swebench-mutate
make setup    # Creates venv, installs deps, prompts for Azure OpenAI credentials
make run      # Runs the example script

By default, make will setup the project, then run the example command. Run make help to see all available commands.

Installation

# Clone the repository
git clone https://github.com/microsoft/swebench-mutate.git
cd swebench-mutate

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install the package
pip install -e .

# For development (includes linting and testing tools)
pip install -e ".[dev,test]"

Environment Setup

Before running the example script, configure your Azure OpenAI credentials:

# Copy the template and edit with your values
cp .env.template .env

# Edit .env with your Azure OpenAI endpoint and authentication

We use Azure OpenAI through LiteLLM by default. To use alternative LLM providers, see the LiteLLM docs.

Required environment variables:

  • AZURE_API_BASE: Your Azure OpenAI resource endpoint (e.g., https://your-resource.openai.azure.com/)
  • AZURE_API_VERSION: API version
  • AZURE_OPENAI_API_KEY: Your API key

Usage

script/run_example.sh runs the CLI on 5 prompts.

Run swebench-mutate --help for instructions on running the standalone Python CLI.

The default configuration is example.yaml. Additional LiteLLM arguments can be passed by configuring additional_args with litellm.completion arguments or through environment variables.

Prompts

See prompt_customization.py for the anonymized mutation prompts used to mutate SWE-Bench prompts.

Citation

GitHub Citation

If you use SWE-Bench-Mutated, please cite:

@misc{garg2025savingswebenchbenchmarkmutation,
      title={Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation}, 
      author={Spandan Garg and Benjamin Steenhoek and Yufan Huang},
      year={2025},
      eprint={2510.08996},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2510.08996}, 
}

Contributing

See CONTRIBUTING.md for guidelines.

License

This repository is released under the MIT License (see LICENSE).

Security

Security reporting information is in SECURITY.md.

Trademark Notice

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

About

This repo provides a CLI that rewrites SWE-Bench prompts using an LLM and saves a dataset which can be used downstream for agent inference.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors