Skip to content

Commit 4bfbb8a

Browse files
authored
Merge pull request #15 from sys-intelligence/arteval_benchmark
Adding ArtEvalBench v0.9
2 parents 51c9588 + f6fee8b commit 4bfbb8a

File tree

539 files changed

+37921
-131
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

539 files changed

+37921
-131
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,3 +146,4 @@ trademarks or logos is subject to and must follow
146146
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
147147
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
148148
Any use of third-party trademarks or logos are subject to those third-party's policies.
149+
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# ----------------------
2+
# General
3+
# ----------------------
4+
*.lock
5+
*.log
6+
*.bak
7+
*.pkl
8+
*.png
9+
*.jpg
10+
*.jpeg
11+
*.pdf
12+
*.xls
13+
*.csv
14+
*.doc
15+
16+
# Logs / temp
17+
logs/
18+
log/
19+
*.tmp
20+
*.temp
21+
*.swp
22+
*.swo
23+
*.orig
24+
25+
# Trash / scratch areas
26+
__pycache__/
27+
trash/
28+
29+
# OS files
30+
.DS_Store
31+
Thumbs.db
32+
33+
# ----------------------
34+
# Python
35+
# ----------------------
36+
37+
# Byte-compiled / optimized / DLL files
38+
*.py[cod]
39+
*$py.class
40+
41+
# Virtual environments
42+
.venv/
43+
venv/
44+
env/
45+
ENV/
46+
47+
# Distribution / packaging
48+
build/
49+
dist/
50+
eggs/
51+
*.egg-info/
52+
.eggs/
53+
pip-wheel-metadata/
54+
*.whl
55+
56+
# Test / coverage
57+
.pytest_cache/
58+
.coverage
59+
.coverage.*
60+
htmlcov/
61+
.tox/
62+
.nox/
63+
64+
# Type checking / tooling
65+
.mypy_cache/
66+
.pyre/
67+
.pytype/
68+
69+
# ----------------------
70+
# Java
71+
# ----------------------
72+
73+
# Compiled files
74+
*.class
75+
76+
# Build outputs
77+
target/
78+
bin/
79+
out/
80+
81+
# Maven / Gradle
82+
.mvn/
83+
.settings/
84+
.gradle/
85+
build/
86+
87+
# IDE project files
88+
*.iml
89+
.idea/
90+
.project
91+
.classpath
92+
93+
# Archives
94+
*.jar
95+
*.war
96+
*.ear
97+
98+
# ----------------------
99+
# C / C++
100+
# ----------------------
101+
102+
# Object / compiled files
103+
*.o
104+
*.obj
105+
*.so
106+
*.dll
107+
*.dylib
108+
*.a
109+
*.lib
110+
*.lo
111+
112+
# Executables
113+
a.out
114+
*.exe
115+
*.out
116+
117+
# Build directories
118+
build/
119+
cmake-build-*/

benchmarks/arteval_bench/README.md

Lines changed: 87 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,61 +1,111 @@
1-
# YourBenchmarkName
1+
# ArtEvalBench
22

3-
## Scenario Description
3+
`ArtEvalBench` is a benchmark for evaluating AI agents that support the Artifact Evaluation (AE) process by auditing research prototypes (artifacts) that accompany research papers, as part of the peer-review process. Artifact evaluation involves reconstructing a reference environment from (partial) specifications, building and configuring complex codebases with often implicit assumptions, preparing datasets and third-party benchmarks whose availability may change over time, orchestrating multi-stage experiments under controlled resource and time budgets, and validating that observed results fall within acceptable tolerance bounds relative to those reported in the paper. Despite the intricacy of the process, we believe AI agents can be trained to support reviewers in evaluating artifacts that accompany research papers by automating most of these stages.
44

5-
Provide a summary of your scenarios here. This section should give an overview of the context, objectives, and key elements involved in your scenarios.
5+
Want to find out more or contribute? Jump to the [contributor's guide](#contributors-guide).
66

7-
### Task Details
7+
## Goals and Objectives
88

9-
Describe your task in detail, including:
9+
Artifact evaluation has become a standard component of the peer-review process across a wide range of conferences in Computer Science, especially in Systems and related areas. Despite this progress however, the practical work of provisioning operational environments, resolving dependencies, building artifacts, preparing benchmarks, running experiments, and checking results remains brittle and time-consuming. To alleviate this burden, we envision an automated artifact evaluation AI assistant that executes repeatable steps under (human) reviewer supervision. This "AE assistant" would target artifact mechanics (e.g., code compilation, dataset/benchmark preparation, experiment orchestration, and output validation) alongside code auditing (e.g., does the artifact implementation match the paper prose? are results closely matching those in the paper?). The agent's output can then inform more a complex methodological assessment, design trade-off analysis, and results interpretation that reviewers need to perform to complete the AE process.
1010

11-
- **Input**: Specify the type of input data required for the task.
12-
- **Output**: Define the expected output from the task.
13-
- **Evaluation**: Explain how to evaluate the output, including any metrics or criteria used to measure performance.
11+
Concretely, given an artifact (code, documentation, experiment framework), a complete installation & operation guide, and the paper itself, the AE assistant:
1412

15-
## Benchmark Setup
13+
1. provisions the reference environment;
14+
15+
2. builds/installs a particular version of the artifact using the specified toolchain;
16+
17+
3. retrieves and prepares datasets or other third-party targets;
18+
19+
4. orchestrates experiments with explicit configuration, time and resource budgets; and
20+
21+
5. generates a human-readable report that summarizes the outcome of each step, indicating any blockers (e.g., install missing dependencies) and how it managed to overcome them.
22+
23+
The goal is to reduce reviewer effort on mechanical tasks so attention can shift to scientific auditing.
24+
25+
## Background
26+
27+
#### » The artifact evaluation process
28+
29+
Most conferences award badges to incentivize high-quality artifacts that support the paper's claims by asking authors to participate in a multi-stage evaluation process where reviewers attempt to download, install, and operate the artifacts themselves. The following summarizes the widely used criteria for each badge:
30+
31+
* Artifact Available. This badge indicates that the artifact itself (code, documentation, scripts, benchmarks, etc.) is publicly accessible with a persistent identifier (e.g., DOI, commit ID) on an (ideally, long-term) archival repository (e.g., Zenodo, Github). Availability does not imply the artifact can compile, build, or is functionally correct. It only confirms that the materials needed to verify key claims, reproduce experimental results, and reuse the tool itself are open-sourced.
32+
33+
* Artifact Functional. This badge indicates that the artifact installs/builds in a reference environment and runs at least a subset of the documented experiments. It confirms that dependencies and configurations are explicitly recorded, and outputs, at least for said subset of experiments, are consistent with the paper's prose.
34+
35+
* Results Reproduced. This badge indicates that a third party can re-execute all necessary experiments to obtain results consistent with the paper, with a reasonable degree of tolerance (e.g., within relative error bounds, confidence intervals, or rank-ordering equivalence). On top of re-obtaining results that support the paper's claims, reproducibility further requires verifiable provenance (e.g., SW/HW environment characteristics, configuration parameters, experiment logs) and principled handling of non-determinism (e.g., repeated trials, fixed initial states, or variance analysis).
1636

17-
### Test in Docker
37+
Further reading and a detailed description of criteria for each badge can be found [here](https://sysartifacts.github.io/eurosys2026/badges) and [here](https://sysartifacts.github.io/evaluator-guide.html).
1838

19-
To test your benchmark in a Docker container, follow these steps:
39+
#### » What makes AE challenging in practice?
2040

21-
1. Build the Docker image using the provided Dockerfile. You can do this by running the following command in the terminal:
41+
Reproducibility and reusability can be obstructed by multiple factors including, but not limited to: (i) environment drift (e.g., legacy libraries no longer available, drivers mismatch in newer OS versions); (ii) undocumented or implicit build assumptions (e.g., hard-coded compiler flags, directory paths, IPs, or reliance on OS-wide libraries that differ across distributions); (iii) brittle preprocessing of third-party benchmarks or datasets (e.g., broken download URL, non-deterministic compilation steps that silently invalidate subsequent stages); and (iv) unspecified results tolerance bounds that complicate validation for non-deterministic experiments (e.g., performance claims without clarifying what constitutes an acceptable deviation when running within a similar SW/HW setup).
2242

23-
```sh
24-
docker build -t your_benchmark_image .
25-
```
43+
Overcoming such challenges require persistence and careful bookkeeping, precisely where an automated AE assistant can provide leverage.
2644

27-
2. Once the image is built, you can run it using the following command:
45+
## Contributor's guide
46+
47+
#### » Overview and high-level structure
48+
49+
To train and improve AE agents in a principled way we introduce `ArtEvalBench`, a curated collection of artifacts accompanying peer-reviewed papers. To ensure a fair comparison we include artifacts that have been already evaluated in an official AE process and awarded all three badges by the committee. Each entry includes the original artifact (instructions, code, scripts, datasets/benchmarks, etc.), the original paper, and a collection of "oracle" scripts that define objective checkpoints at four canonical stages: environment setup, build/install, benchmark preparation, and experiment execution.
50+
51+
`ArtEvalBench` is designed to evaluate agents on capability (which stages they complete), efficiency (wall-clock time and intervention count), and fidelity (how closely reproduced results match those reported).
52+
53+
To check those capabilities, each artifact includes four oracle scripts that encode minimal, verifiable success criteria for each of the four stages. The oracles are invoked non-interactively and must be idempotent. Conceptually, these for stages correspond to:
54+
55+
1. Environment Setup: verifies presence and versions of required tools, libraries, or other dependencies; confirms hardware availability when applicable; and checks that configurations are portable rather than hardcoded or tied to a specific machine.
56+
2. Build/Install: confirms a complete build (or install) operation from a specified version, with expected binaries/modules present; running tests, when available, or simple validation commands like invoking `--help` or equivalent.
57+
3. Benchmark Preparation: asserts that datasets/benchmarks are present and checksums match; verifies that necessary third-party tools compile and the artifact's instrumentation/monitoring hooks are enabled, if applicable.
58+
4. Experiment Runs: executes each experiment according to the authors' guidelines; checks that the artifact produces the expected metrics, logs, files, figures, etc.; provides an initial assessment relative to specified tolerance bounds.
59+
60+
For a typical example, check out the [agent evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval/) of [WASABI](data/benchmark/sosp24_wasabi/wasabi/).
61+
62+
#### » Adding a new artifact
63+
64+
Adding a new artifact to the benchmark requires several steps:
65+
66+
1. Create a stand-alone directory in `./data/benchmark` and copying all artifact files including the README file.
67+
2. Implement oracles for evaluating the AI agent. This feature should follow the same structure as Wasabi's [evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval/), where each oracle is implemented in a separate Python source file and orchestrated by a `main.py` whose `main()` method returns a single integer, the overal score (0..4) the agent achieved.
68+
3. Create an entry into the [task journal](data/benchmark/arteval_tasks.jsonl) and populate the appropriate fields.
69+
70+
## Benchmark Setup
2871

29-
```sh
30-
docker run -it --rm your_benchmark_image
31-
# docker run --rm your_benchmark_image
32-
```
72+
#### » Install dependencies
3373

34-
3. Inside the container, navigate to the appropriate directory and execute the benchmark script to start the testing process.
74+
To install the benchmark, simply run the `install.sh` script to set up the environment:
75+
```sh
76+
./install.sh
77+
```
3578

36-
```sh
37-
./run.sh
38-
```
79+
This operaiton will:
80+
* Install Python 3.12 virtual environment
81+
* Clone and install SWE-agent
82+
* Install required Python packages (pytest, pytest-cov)
83+
* Clone course repositories (6.5840-golabs-2024, xv6-labs-2024, etc.)
3984

40-
### Maunaly Test
85+
#### » Run the benchmark
4186

42-
To manually test your benchmark, follow these steps:
87+
To run the benchmark:
4388

44-
#### Install Dependencies
89+
1. Execute the `run.sh` script with your model:
4590

46-
To install and configure your benchmark, follow these steps:
91+
```sh
92+
./run.sh <model_name>
93+
# Example: ./run.sh claude-sonnet-4-5-20250929
94+
```
4795

48-
1. Run the `install.sh` script to set up the environment and install necessary dependencies. You can simply execute the following command:
96+
2. Configure your LLM endpoint in `env.toml`:
97+
* For Azure/OpenAI models: Set `AZURE_API_KEY`, `AZURE_API_BASE`, `AZURE_API_VERSION`
98+
* For Anthropic models: Set `ANTHROPIC_API_KEY`
99+
* For self-hosted models: Configure `OPENAI_API_TYPE` and `OPENAI_BASE_URL`
49100

50-
```sh
51-
./install.sh
52-
```
101+
3. Results will be saved to `outputs/` with timestamp and model information
53102

54-
#### Run
55103

56-
To run your benchmark and obtain results for a specific task and model, follow these steps:
104+
#### » Supported Agents
57105

58-
1. Review the `run.sh` script to understand the expected commands and parameters.
59-
2. Execute the `run.sh` script to start the benchmark. The script will guide you through the process and generate the results.
106+
The benchmark supports multiple AI agents:
107+
* **Claude Code**: Anthropic's code assistant
108+
* **Mini SWE Agent**: The compact version of [SWE-agent](https://github.com/SWE-agent) assistant
109+
* **OpenHands**: Open-source coding agent
60110

61-
Feel free to adjust the details to better fit your specific scenario and requirements. Let me know if there's anything else you need!
111+
To add your own agent to the benchmark, see [add_agents.md](add_agents.md).

0 commit comments

Comments
 (0)