|
1 | | -# YourBenchmarkName |
| 1 | +# ArtEvalBench |
2 | 2 |
|
3 | | -## Scenario Description |
| 3 | +`ArtEvalBench` is a benchmark for evaluating AI agents that support the Artifact Evaluation (AE) process by auditing research prototypes (artifacts) that accompany research papers, as part of the peer-review process. Artifact evaluation involves reconstructing a reference environment from (partial) specifications, building and configuring complex codebases with often implicit assumptions, preparing datasets and third-party benchmarks whose availability may change over time, orchestrating multi-stage experiments under controlled resource and time budgets, and validating that observed results fall within acceptable tolerance bounds relative to those reported in the paper. Despite the intricacy of the process, we believe AI agents can be trained to support reviewers in evaluating artifacts that accompany research papers by automating most of these stages. |
4 | 4 |
|
5 | | -Provide a summary of your scenarios here. This section should give an overview of the context, objectives, and key elements involved in your scenarios. |
| 5 | +Want to find out more or contribute? Jump to the [contributor's guide](#contributors-guide). |
6 | 6 |
|
7 | | -### Task Details |
| 7 | +## Goals and Objectives |
8 | 8 |
|
9 | | -Describe your task in detail, including: |
| 9 | +Artifact evaluation has become a standard component of the peer-review process across a wide range of conferences in Computer Science, especially in Systems and related areas. Despite this progress however, the practical work of provisioning operational environments, resolving dependencies, building artifacts, preparing benchmarks, running experiments, and checking results remains brittle and time-consuming. To alleviate this burden, we envision an automated artifact evaluation AI assistant that executes repeatable steps under (human) reviewer supervision. This "AE assistant" would target artifact mechanics (e.g., code compilation, dataset/benchmark preparation, experiment orchestration, and output validation) alongside code auditing (e.g., does the artifact implementation match the paper prose? are results closely matching those in the paper?). The agent's output can then inform more a complex methodological assessment, design trade-off analysis, and results interpretation that reviewers need to perform to complete the AE process. |
10 | 10 |
|
11 | | -- **Input**: Specify the type of input data required for the task. |
12 | | -- **Output**: Define the expected output from the task. |
13 | | -- **Evaluation**: Explain how to evaluate the output, including any metrics or criteria used to measure performance. |
| 11 | +Concretely, given an artifact (code, documentation, experiment framework), a complete installation & operation guide, and the paper itself, the AE assistant: |
14 | 12 |
|
15 | | -## Benchmark Setup |
| 13 | +1. provisions the reference environment; |
| 14 | + |
| 15 | +2. builds/installs a particular version of the artifact using the specified toolchain; |
| 16 | + |
| 17 | +3. retrieves and prepares datasets or other third-party targets; |
| 18 | + |
| 19 | +4. orchestrates experiments with explicit configuration, time and resource budgets; and |
| 20 | + |
| 21 | +5. generates a human-readable report that summarizes the outcome of each step, indicating any blockers (e.g., install missing dependencies) and how it managed to overcome them. |
| 22 | + |
| 23 | +The goal is to reduce reviewer effort on mechanical tasks so attention can shift to scientific auditing. |
| 24 | + |
| 25 | +## Background |
| 26 | + |
| 27 | +#### » The artifact evaluation process |
| 28 | + |
| 29 | +Most conferences award badges to incentivize high-quality artifacts that support the paper's claims by asking authors to participate in a multi-stage evaluation process where reviewers attempt to download, install, and operate the artifacts themselves. The following summarizes the widely used criteria for each badge: |
| 30 | + |
| 31 | +* Artifact Available. This badge indicates that the artifact itself (code, documentation, scripts, benchmarks, etc.) is publicly accessible with a persistent identifier (e.g., DOI, commit ID) on an (ideally, long-term) archival repository (e.g., Zenodo, Github). Availability does not imply the artifact can compile, build, or is functionally correct. It only confirms that the materials needed to verify key claims, reproduce experimental results, and reuse the tool itself are open-sourced. |
| 32 | + |
| 33 | +* Artifact Functional. This badge indicates that the artifact installs/builds in a reference environment and runs at least a subset of the documented experiments. It confirms that dependencies and configurations are explicitly recorded, and outputs, at least for said subset of experiments, are consistent with the paper's prose. |
| 34 | + |
| 35 | +* Results Reproduced. This badge indicates that a third party can re-execute all necessary experiments to obtain results consistent with the paper, with a reasonable degree of tolerance (e.g., within relative error bounds, confidence intervals, or rank-ordering equivalence). On top of re-obtaining results that support the paper's claims, reproducibility further requires verifiable provenance (e.g., SW/HW environment characteristics, configuration parameters, experiment logs) and principled handling of non-determinism (e.g., repeated trials, fixed initial states, or variance analysis). |
16 | 36 |
|
17 | | -### Test in Docker |
| 37 | +Further reading and a detailed description of criteria for each badge can be found [here](https://sysartifacts.github.io/eurosys2026/badges) and [here](https://sysartifacts.github.io/evaluator-guide.html). |
18 | 38 |
|
19 | | -To test your benchmark in a Docker container, follow these steps: |
| 39 | +#### » What makes AE challenging in practice? |
20 | 40 |
|
21 | | -1. Build the Docker image using the provided Dockerfile. You can do this by running the following command in the terminal: |
| 41 | +Reproducibility and reusability can be obstructed by multiple factors including, but not limited to: (i) environment drift (e.g., legacy libraries no longer available, drivers mismatch in newer OS versions); (ii) undocumented or implicit build assumptions (e.g., hard-coded compiler flags, directory paths, IPs, or reliance on OS-wide libraries that differ across distributions); (iii) brittle preprocessing of third-party benchmarks or datasets (e.g., broken download URL, non-deterministic compilation steps that silently invalidate subsequent stages); and (iv) unspecified results tolerance bounds that complicate validation for non-deterministic experiments (e.g., performance claims without clarifying what constitutes an acceptable deviation when running within a similar SW/HW setup). |
22 | 42 |
|
23 | | - ```sh |
24 | | - docker build -t your_benchmark_image . |
25 | | - ``` |
| 43 | +Overcoming such challenges require persistence and careful bookkeeping, precisely where an automated AE assistant can provide leverage. |
26 | 44 |
|
27 | | -2. Once the image is built, you can run it using the following command: |
| 45 | +## Contributor's guide |
| 46 | + |
| 47 | +#### » Overview and high-level structure |
| 48 | + |
| 49 | +To train and improve AE agents in a principled way we introduce `ArtEvalBench`, a curated collection of artifacts accompanying peer-reviewed papers. To ensure a fair comparison we include artifacts that have been already evaluated in an official AE process and awarded all three badges by the committee. Each entry includes the original artifact (instructions, code, scripts, datasets/benchmarks, etc.), the original paper, and a collection of "oracle" scripts that define objective checkpoints at four canonical stages: environment setup, build/install, benchmark preparation, and experiment execution. |
| 50 | + |
| 51 | +`ArtEvalBench` is designed to evaluate agents on capability (which stages they complete), efficiency (wall-clock time and intervention count), and fidelity (how closely reproduced results match those reported). |
| 52 | + |
| 53 | +To check those capabilities, each artifact includes four oracle scripts that encode minimal, verifiable success criteria for each of the four stages. The oracles are invoked non-interactively and must be idempotent. Conceptually, these for stages correspond to: |
| 54 | + |
| 55 | +1. Environment Setup: verifies presence and versions of required tools, libraries, or other dependencies; confirms hardware availability when applicable; and checks that configurations are portable rather than hardcoded or tied to a specific machine. |
| 56 | +2. Build/Install: confirms a complete build (or install) operation from a specified version, with expected binaries/modules present; running tests, when available, or simple validation commands like invoking `--help` or equivalent. |
| 57 | +3. Benchmark Preparation: asserts that datasets/benchmarks are present and checksums match; verifies that necessary third-party tools compile and the artifact's instrumentation/monitoring hooks are enabled, if applicable. |
| 58 | +4. Experiment Runs: executes each experiment according to the authors' guidelines; checks that the artifact produces the expected metrics, logs, files, figures, etc.; provides an initial assessment relative to specified tolerance bounds. |
| 59 | + |
| 60 | +For a typical example, check out the [agent evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval/) of [WASABI](data/benchmark/sosp24_wasabi/wasabi/). |
| 61 | + |
| 62 | +#### » Adding a new artifact |
| 63 | + |
| 64 | +Adding a new artifact to the benchmark requires several steps: |
| 65 | + |
| 66 | +1. Create a stand-alone directory in `./data/benchmark` and copying all artifact files including the README file. |
| 67 | +2. Implement oracles for evaluating the AI agent. This feature should follow the same structure as Wasabi's [evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval/), where each oracle is implemented in a separate Python source file and orchestrated by a `main.py` whose `main()` method returns a single integer, the overal score (0..4) the agent achieved. |
| 68 | +3. Create an entry into the [task journal](data/benchmark/arteval_tasks.jsonl) and populate the appropriate fields. |
| 69 | + |
| 70 | +## Benchmark Setup |
28 | 71 |
|
29 | | - ```sh |
30 | | - docker run -it --rm your_benchmark_image |
31 | | - # docker run --rm your_benchmark_image |
32 | | - ``` |
| 72 | +#### » Install dependencies |
33 | 73 |
|
34 | | -3. Inside the container, navigate to the appropriate directory and execute the benchmark script to start the testing process. |
| 74 | +To install the benchmark, simply run the `install.sh` script to set up the environment: |
| 75 | + ```sh |
| 76 | + ./install.sh |
| 77 | + ``` |
35 | 78 |
|
36 | | - ```sh |
37 | | - ./run.sh |
38 | | - ``` |
| 79 | + This operaiton will: |
| 80 | + * Install Python 3.12 virtual environment |
| 81 | + * Clone and install SWE-agent |
| 82 | + * Install required Python packages (pytest, pytest-cov) |
| 83 | + * Clone course repositories (6.5840-golabs-2024, xv6-labs-2024, etc.) |
39 | 84 |
|
40 | | -### Maunaly Test |
| 85 | +#### » Run the benchmark |
41 | 86 |
|
42 | | -To manually test your benchmark, follow these steps: |
| 87 | +To run the benchmark: |
43 | 88 |
|
44 | | -#### Install Dependencies |
| 89 | +1. Execute the `run.sh` script with your model: |
45 | 90 |
|
46 | | -To install and configure your benchmark, follow these steps: |
| 91 | + ```sh |
| 92 | + ./run.sh <model_name> |
| 93 | + # Example: ./run.sh claude-sonnet-4-5-20250929 |
| 94 | + ``` |
47 | 95 |
|
48 | | -1. Run the `install.sh` script to set up the environment and install necessary dependencies. You can simply execute the following command: |
| 96 | +2. Configure your LLM endpoint in `env.toml`: |
| 97 | +* For Azure/OpenAI models: Set `AZURE_API_KEY`, `AZURE_API_BASE`, `AZURE_API_VERSION` |
| 98 | +* For Anthropic models: Set `ANTHROPIC_API_KEY` |
| 99 | +* For self-hosted models: Configure `OPENAI_API_TYPE` and `OPENAI_BASE_URL` |
49 | 100 |
|
50 | | - ```sh |
51 | | - ./install.sh |
52 | | - ``` |
| 101 | +3. Results will be saved to `outputs/` with timestamp and model information |
53 | 102 |
|
54 | | -#### Run |
55 | 103 |
|
56 | | -To run your benchmark and obtain results for a specific task and model, follow these steps: |
| 104 | +#### » Supported Agents |
57 | 105 |
|
58 | | -1. Review the `run.sh` script to understand the expected commands and parameters. |
59 | | -2. Execute the `run.sh` script to start the benchmark. The script will guide you through the process and generate the results. |
| 106 | +The benchmark supports multiple AI agents: |
| 107 | +* **Claude Code**: Anthropic's code assistant |
| 108 | +* **Mini SWE Agent**: The compact version of [SWE-agent](https://github.com/SWE-agent) assistant |
| 109 | +* **OpenHands**: Open-source coding agent |
60 | 110 |
|
61 | | -Feel free to adjust the details to better fit your specific scenario and requirements. Let me know if there's anything else you need! |
| 111 | +To add your own agent to the benchmark, see [add_agents.md](add_agents.md). |
0 commit comments