Skip to content

Commit 13c5fa1

Browse files
authored
Merge pull request #21 from bastoica/main
Improving the "contributor's guide" and simplifying the benchmark's schema
2 parents 535a78f + 6409d2f commit 13c5fa1

File tree

12 files changed

+220
-225
lines changed

12 files changed

+220
-225
lines changed
Lines changed: 26 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,34 @@
11
FROM ubuntu:24.04
2-
3-
WORKDIR /usr/src
2+
3+
ARG DEBIAN_FRONTEND=noninteractive
4+
5+
USER root
6+
7+
WORKDIR /
48
COPY . .
5-
RUN apt-get update && apt-get install -y \
9+
10+
RUN rm -rf /var/lib/apt/lists/* \
11+
&& apt-get update -o Acquire::Retries=5 \
12+
&& apt-get install -y --no-install-recommends \
613
build-essential \
714
git \
815
wget \
916
python3-pip \
10-
python3-venv
17+
python3-venv \
18+
pipx \
19+
&& rm -rf /var/lib/apt/lists/*
20+
21+
# SWE-ReX will always attempt to install its server into your docker container
22+
# however, this takes a couple of seconds. If we already provide it in the image,
23+
# this is much faster.
24+
RUN pipx install swe-rex
25+
RUN pipx ensurepath
26+
27+
ENV PATH="/root/.local/bin:${PATH}"
28+
ENV PATH="/usr/local/go/bin:${PATH}"
29+
30+
SHELL ["/bin/bash", "-c"]
1131

1232
RUN chmod +x install.sh test.sh && ./install.sh
13-
14-
ENTRYPOINT ["./test.sh"]
33+
34+
CMD ["bash"]

benchmarks/arteval_bench/README.md

Lines changed: 86 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,87 +1,104 @@
11
# ArtEvalBench
22

3-
`ArtEvalBench` is a benchmark for evaluating AI agents that support the Artifact Evaluation (AE) process by auditing research prototypes (artifacts) that accompany research papers, as part of the peer-review process. Artifact evaluation involves reconstructing a reference environment from (partial) specifications, building and configuring complex codebases with often implicit assumptions, preparing datasets and third-party benchmarks whose availability may change over time, orchestrating multi-stage experiments under controlled resource and time budgets, and validating that observed results fall within acceptable tolerance bounds relative to those reported in the paper. Despite the intricacy of the process, we believe AI agents can be trained to support reviewers in evaluating artifacts that accompany research papers by automating most of these stages.
4-
5-
Want to find out more or contribute? Jump to the [contributor's guide](#contributors-guide).
6-
7-
## Goals and Objectives
8-
9-
Artifact evaluation has become a standard component of the peer-review process across a wide range of conferences in Computer Science, especially in Systems and related areas. Despite this progress however, the practical work of provisioning operational environments, resolving dependencies, building artifacts, preparing benchmarks, running experiments, and checking results remains brittle and time-consuming. To alleviate this burden, we envision an automated artifact evaluation AI assistant that executes repeatable steps under (human) reviewer supervision. This "AE assistant" would target artifact mechanics (e.g., code compilation, dataset/benchmark preparation, experiment orchestration, and output validation) alongside code auditing (e.g., does the artifact implementation match the paper prose? are results closely matching those in the paper?). The agent's output can then inform more a complex methodological assessment, design trade-off analysis, and results interpretation that reviewers need to perform to complete the AE process.
10-
11-
Concretely, given an artifact (code, documentation, experiment framework), a complete installation & operation guide, and the paper itself, the AE assistant:
12-
13-
1. provisions the reference environment;
14-
15-
2. builds/installs a particular version of the artifact using the specified toolchain;
16-
17-
3. retrieves and prepares datasets or other third-party targets;
18-
19-
4. orchestrates experiments with explicit configuration, time and resource budgets; and
20-
21-
5. generates a human-readable report that summarizes the outcome of each step, indicating any blockers (e.g., install missing dependencies) and how it managed to overcome them.
22-
23-
The goal is to reduce reviewer effort on mechanical tasks so attention can shift to scientific auditing.
24-
25-
## Background
26-
27-
#### » The artifact evaluation process
28-
29-
Most conferences award badges to incentivize high-quality artifacts that support the paper's claims by asking authors to participate in a multi-stage evaluation process where reviewers attempt to download, install, and operate the artifacts themselves. The following summarizes the widely used criteria for each badge:
30-
31-
* Artifact Available. This badge indicates that the artifact itself (code, documentation, scripts, benchmarks, etc.) is publicly accessible with a persistent identifier (e.g., DOI, commit ID) on an (ideally, long-term) archival repository (e.g., Zenodo, Github). Availability does not imply the artifact can compile, build, or is functionally correct. It only confirms that the materials needed to verify key claims, reproduce experimental results, and reuse the tool itself are open-sourced.
32-
33-
* Artifact Functional. This badge indicates that the artifact installs/builds in a reference environment and runs at least a subset of the documented experiments. It confirms that dependencies and configurations are explicitly recorded, and outputs, at least for said subset of experiments, are consistent with the paper's prose.
34-
35-
* Results Reproduced. This badge indicates that a third party can re-execute all necessary experiments to obtain results consistent with the paper, with a reasonable degree of tolerance (e.g., within relative error bounds, confidence intervals, or rank-ordering equivalence). On top of re-obtaining results that support the paper's claims, reproducibility further requires verifiable provenance (e.g., SW/HW environment characteristics, configuration parameters, experiment logs) and principled handling of non-determinism (e.g., repeated trials, fixed initial states, or variance analysis).
36-
37-
Further reading and a detailed description of criteria for each badge can be found [here](https://sysartifacts.github.io/eurosys2026/badges) and [here](https://sysartifacts.github.io/evaluator-guide.html).
38-
39-
#### » What makes AE challenging in practice?
40-
41-
Reproducibility and reusability can be obstructed by multiple factors including, but not limited to: (i) environment drift (e.g., legacy libraries no longer available, drivers mismatch in newer OS versions); (ii) undocumented or implicit build assumptions (e.g., hard-coded compiler flags, directory paths, IPs, or reliance on OS-wide libraries that differ across distributions); (iii) brittle preprocessing of third-party benchmarks or datasets (e.g., broken download URL, non-deterministic compilation steps that silently invalidate subsequent stages); and (iv) unspecified results tolerance bounds that complicate validation for non-deterministic experiments (e.g., performance claims without clarifying what constitutes an acceptable deviation when running within a similar SW/HW setup).
42-
43-
Overcoming such challenges require persistence and careful bookkeeping, precisely where an automated AE assistant can provide leverage.
3+
`ArtEvalBench` is a benchmark for evaluating AI agents against Artifact Evaluation (AE) tasks ([why artifact evaluation?](WHY.md)). We believe that, despite the complexity of the AE process, AI agents can be succesfully trained to automatically evaluate artifacts that accompany research papers.
444

455
## Contributor's guide
466

477
#### » Overview and high-level structure
488

49-
To train and improve AE agents in a principled way we introduce `ArtEvalBench`, a curated collection of artifacts accompanying peer-reviewed papers. To ensure a fair comparison we include artifacts that have been already evaluated in an official AE process and awarded all three badges by the committee. Each entry includes the original artifact (instructions, code, scripts, datasets/benchmarks, etc.), the original paper, and a collection of "oracle" scripts that define objective checkpoints at four canonical stages: environment setup, build/install, benchmark preparation, and experiment execution.
9+
To train and improve AE agents in a principled way, we introduce `ArtEvalBench`, a curated collection of artifacts accompanying peer-reviewed papers. To ensure a fair comparison, we include artifacts that have already been evaluated in an official AE process and awarded all three badges by the committee. Each entry includes the original artifact (instructions, code, scripts, datasets/benchmarks, etc.), the original paper, and a collection of "oracle" scripts that define objective checkpoints at four canonical stages: environment setup, build/install, benchmark preparation, and experiment execution.
5010

5111
`ArtEvalBench` is designed to evaluate agents on capability (which stages they complete), efficiency (wall-clock time and intervention count), and fidelity (how closely reproduced results match those reported).
5212

53-
To check those capabilities, each artifact includes four oracle scripts that encode minimal, verifiable success criteria for each of the four stages. The oracles are invoked non-interactively and must be idempotent. Conceptually, these for stages correspond to:
13+
To check those capabilities, each artifact includes four oracle scripts that encode minimal, verifiable success criteria for each of the four stages. The oracles are invoked non-interactively and must be idempotent. Conceptually, these four stages correspond to:
5414

55-
1. Environment Setup: verifies presence and versions of required tools, libraries, or other dependencies; confirms hardware availability when applicable; and checks that configurations are portable rather than hardcoded or tied to a specific machine.
56-
2. Build/Install: confirms a complete build (or install) operation from a specified version, with expected binaries/modules present; running tests, when available, or simple validation commands like invoking `--help` or equivalent.
57-
3. Benchmark Preparation: asserts that datasets/benchmarks are present and checksums match; verifies that necessary third-party tools compile and the artifact's instrumentation/monitoring hooks are enabled, if applicable.
58-
4. Experiment Runs: executes each experiment according to the authors' guidelines; checks that the artifact produces the expected metrics, logs, files, figures, etc.; provides an initial assessment relative to specified tolerance bounds.
59-
60-
For a typical example, check out the [agent evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval/) of [WASABI](data/benchmark/sosp24_wasabi/wasabi/).
15+
1. **Environment setup.** verifies presence and versions of required tools, libraries, or other dependencies; confirms hardware availability when applicable; and checks that configurations are portable rather than hardcoded or tied to a specific machine.
16+
2. **Build (and install) the artifact.** confirms a complete build (or install) operation from a specified version, with expected binaries/modules present; running tests, when available, or simple validation commands like invoking `--help` or equivalent.
17+
3. **Benchmark preparation.** asserts that datasets/benchmarks are present and checksums match; verifies that necessary third-party tools compile and the artifact's instrumentation/monitoring hooks are enabled, if applicable.
18+
4. **Experiment runs.** executes each experiment according to the authors' guidelines; checks that the artifact produces the expected metrics, logs, files, figures, etc.; provides an initial assessment relative to specified tolerance bounds.
6119

6220
#### » Adding a new artifact
6321

64-
Adding a new artifact to the benchmark requires several steps:
22+
Adding to the benchmark requires users to include a new entry into `ArtEvalBench` [schema file](data/benchmark/arteval_tasks.jsonl), where:
23+
- `artifact_id` is a unique identifier for the artifact;
24+
- `artifact_dir` the artifact directory within `data/benchmark/`;
25+
- `artifact_readme` is the path to the artifact's README file that contains the step-by-step guide for preparing, installing, and running experiments;
26+
- `artifact_url` the URL to the original artifact;
27+
- `evaluator` is a path to the evaluator's `main.py` entrypoint;
28+
- `expected_score` is the total expected score for this artifact, which defaults to 4 as the agent is evaluated on it succesfully completing the four canonical AE stages (!!NOTE!! We encourage users not to change this value, unless they opt for another universal metric for artifact evaluation).
29+
- `docker_evn` (optional) points to a Docker image on Docker Hub.
30+
31+
It also requires users to extend the artifact they plan to add with a self-contained evaluator in an `_agent_eval/` directory. This evaluator encodes *minimal*, objective success criteria for the four canonical AE stages and is what the benchmark actually calls.
32+
33+
Using WASABI's [agent evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval/) as a template, users will therefore need to extend the artifact with:
34+
35+
1. An `_agent_eval/` package which contains all benchmark-specific code and does *not* modify your original artifact logic.
36+
37+
2. One oracle module per stage, implemented in four distinct Python files each checking one of the four canonical stages of artifact evaluation. A typical oracle module looks as follows (simplified):
38+
```python
39+
# _agent_eval/env_setup.py
40+
import subprocess
41+
from pathlib import Path
42+
43+
def check() -> bool:
44+
# Example: verify virtualenv exists
45+
if not Path("venv").exists():
46+
print("Missing venv/ directory")
47+
return False
48+
49+
# Example: verify Python version inside the venv
50+
proc = subprocess.run(
51+
["venv/bin/python", "--version"],
52+
capture_output=True,
53+
text=True,
54+
)
55+
print(proc.stdout.strip())
56+
return proc.returncode == 0 and proc.stdout.startswith("Python 3.10")
57+
```
58+
Also, note that each oracle should be:
59+
- Non-interactive, meaning not expecting input or prompt interactions.
60+
- Idempotent, meaning safe to run multiple times without side-effects.
61+
- It returns `True` or `False` based on the validation outcome and prints a brief diagnostic message.
62+
63+
3. A single `main.py` orchestrator, the entrypoint used by ArtEvalBench, which invokes the four oracle modules, runs them in order, and returns an overall score (an integer between 0 and 4):
64+
```python
65+
# _agent_eval/main.py
66+
from . import env_setup, build_install, prep_benchmark, run_experiments
67+
68+
def main() -> int:
69+
score = 0
70+
stages = [
71+
("env_setup", env_setup.check),
72+
("build_install", build_install.check),
73+
("prep_benchmark", prep_benchmark.check),
74+
("run_experiments", run_experiments.check),
75+
]
76+
77+
for name, check in stages:
78+
try:
79+
ok = bool(check())
80+
except Exception as e:
81+
print(f"[{name}] FAILED with exception: {e}")
82+
ok = False
83+
84+
if ok:
85+
print(f"[{name}] PASSED")
86+
score += 1
87+
else:
88+
print(f"[{name}] FAILED")
89+
90+
print(f"FINAL_SCORE {score}/4")
91+
return score
92+
93+
if __name__ == "__main__":
94+
raise SystemExit(main())
95+
```
96+
97+
Note that the `ArtEvalBench` framework will invoke `main.py` to run the oracles in order, compute the agent's score for this particular artifact, and store it into a JSON file that aggregates these outcomes for the entire benchmark.
6598

66-
1. Create a stand-alone directory in `./data/benchmark` and copying all artifact files including the README file.
67-
2. Implement oracles for evaluating the AI agent. This feature should follow the same structure as Wasabi's [evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval/), where each oracle is implemented in a separate Python source file and orchestrated by a `main.py` whose `main()` method returns a single integer, the overal score (0..4) the agent achieved.
68-
3. Create an entry into the [task journal](data/benchmark/arteval_tasks.jsonl) and populate the appropriate fields.
6999

70100
## Benchmark Setup
71101

72-
#### » Install dependencies
73-
74-
To install the benchmark, simply run the `install.sh` script to set up the environment:
75-
```sh
76-
./install.sh
77-
```
78-
79-
This operaiton will:
80-
* Install Python 3.12 virtual environment
81-
* Clone and install SWE-agent
82-
* Install required Python packages (pytest, pytest-cov)
83-
* Clone course repositories (6.5840-golabs-2024, xv6-labs-2024, etc.)
84-
85102
#### » Run the benchmark
86103

87104
To run the benchmark:
@@ -104,8 +121,8 @@ To run the benchmark:
104121
#### » Supported Agents
105122

106123
The benchmark supports multiple AI agents:
107-
* **Claude Code**: Anthropic's code assistant
108-
* **Mini SWE Agent**: The compact version of [SWE-agent](https://github.com/SWE-agent) assistant
109-
* **OpenHands**: Open-source coding agent
124+
- **Claude Code**: Anthropic's code assistant
125+
- **Mini SWE Agent**: The compact version of [SWE-agent](https://github.com/SWE-agent) assistant
126+
- **OpenHands**: Open-source coding agent
110127

111128
To add your own agent to the benchmark, see [add_agents.md](add_agents.md).

0 commit comments

Comments
 (0)