You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`ArtEvalBench` is a benchmark for evaluating AI agents that support the Artifact Evaluation (AE) process by auditing research prototypes (artifacts) that accompany research papers, as part of the peer-review process. Artifact evaluation involves reconstructing a reference environment from (partial) specifications, building and configuring complex codebases with often implicit assumptions, preparing datasets and third-party benchmarks whose availability may change over time, orchestrating multi-stage experiments under controlled resource and time budgets, and validating that observed results fall within acceptable tolerance bounds relative to those reported in the paper. Despite the intricacy of the process, we believe AI agents can be trained to support reviewers in evaluating artifacts that accompany research papers by automating most of these stages.
4
-
5
-
Want to find out more or contribute? Jump to the [contributor's guide](#contributors-guide).
6
-
7
-
## Goals and Objectives
8
-
9
-
Artifact evaluation has become a standard component of the peer-review process across a wide range of conferences in Computer Science, especially in Systems and related areas. Despite this progress however, the practical work of provisioning operational environments, resolving dependencies, building artifacts, preparing benchmarks, running experiments, and checking results remains brittle and time-consuming. To alleviate this burden, we envision an automated artifact evaluation AI assistant that executes repeatable steps under (human) reviewer supervision. This "AE assistant" would target artifact mechanics (e.g., code compilation, dataset/benchmark preparation, experiment orchestration, and output validation) alongside code auditing (e.g., does the artifact implementation match the paper prose? are results closely matching those in the paper?). The agent's output can then inform more a complex methodological assessment, design trade-off analysis, and results interpretation that reviewers need to perform to complete the AE process.
10
-
11
-
Concretely, given an artifact (code, documentation, experiment framework), a complete installation & operation guide, and the paper itself, the AE assistant:
12
-
13
-
1. provisions the reference environment;
14
-
15
-
2. builds/installs a particular version of the artifact using the specified toolchain;
16
-
17
-
3. retrieves and prepares datasets or other third-party targets;
18
-
19
-
4. orchestrates experiments with explicit configuration, time and resource budgets; and
20
-
21
-
5. generates a human-readable report that summarizes the outcome of each step, indicating any blockers (e.g., install missing dependencies) and how it managed to overcome them.
22
-
23
-
The goal is to reduce reviewer effort on mechanical tasks so attention can shift to scientific auditing.
24
-
25
-
## Background
26
-
27
-
#### » The artifact evaluation process
28
-
29
-
Most conferences award badges to incentivize high-quality artifacts that support the paper's claims by asking authors to participate in a multi-stage evaluation process where reviewers attempt to download, install, and operate the artifacts themselves. The following summarizes the widely used criteria for each badge:
30
-
31
-
* Artifact Available. This badge indicates that the artifact itself (code, documentation, scripts, benchmarks, etc.) is publicly accessible with a persistent identifier (e.g., DOI, commit ID) on an (ideally, long-term) archival repository (e.g., Zenodo, Github). Availability does not imply the artifact can compile, build, or is functionally correct. It only confirms that the materials needed to verify key claims, reproduce experimental results, and reuse the tool itself are open-sourced.
32
-
33
-
* Artifact Functional. This badge indicates that the artifact installs/builds in a reference environment and runs at least a subset of the documented experiments. It confirms that dependencies and configurations are explicitly recorded, and outputs, at least for said subset of experiments, are consistent with the paper's prose.
34
-
35
-
* Results Reproduced. This badge indicates that a third party can re-execute all necessary experiments to obtain results consistent with the paper, with a reasonable degree of tolerance (e.g., within relative error bounds, confidence intervals, or rank-ordering equivalence). On top of re-obtaining results that support the paper's claims, reproducibility further requires verifiable provenance (e.g., SW/HW environment characteristics, configuration parameters, experiment logs) and principled handling of non-determinism (e.g., repeated trials, fixed initial states, or variance analysis).
36
-
37
-
Further reading and a detailed description of criteria for each badge can be found [here](https://sysartifacts.github.io/eurosys2026/badges) and [here](https://sysartifacts.github.io/evaluator-guide.html).
38
-
39
-
#### » What makes AE challenging in practice?
40
-
41
-
Reproducibility and reusability can be obstructed by multiple factors including, but not limited to: (i) environment drift (e.g., legacy libraries no longer available, drivers mismatch in newer OS versions); (ii) undocumented or implicit build assumptions (e.g., hard-coded compiler flags, directory paths, IPs, or reliance on OS-wide libraries that differ across distributions); (iii) brittle preprocessing of third-party benchmarks or datasets (e.g., broken download URL, non-deterministic compilation steps that silently invalidate subsequent stages); and (iv) unspecified results tolerance bounds that complicate validation for non-deterministic experiments (e.g., performance claims without clarifying what constitutes an acceptable deviation when running within a similar SW/HW setup).
42
-
43
-
Overcoming such challenges require persistence and careful bookkeeping, precisely where an automated AE assistant can provide leverage.
3
+
`ArtEvalBench` is a benchmark for evaluating AI agents against Artifact Evaluation (AE) tasks ([why artifact evaluation?](WHY.md)). We believe that, despite the complexity of the AE process, AI agents can be succesfully trained to automatically evaluate artifacts that accompany research papers.
44
4
45
5
## Contributor's guide
46
6
47
7
#### » Overview and high-level structure
48
8
49
-
To train and improve AE agents in a principled way we introduce `ArtEvalBench`, a curated collection of artifacts accompanying peer-reviewed papers. To ensure a fair comparison we include artifacts that have been already evaluated in an official AE process and awarded all three badges by the committee. Each entry includes the original artifact (instructions, code, scripts, datasets/benchmarks, etc.), the original paper, and a collection of "oracle" scripts that define objective checkpoints at four canonical stages: environment setup, build/install, benchmark preparation, and experiment execution.
9
+
To train and improve AE agents in a principled way, we introduce `ArtEvalBench`, a curated collection of artifacts accompanying peer-reviewed papers. To ensure a fair comparison, we include artifacts that have already been evaluated in an official AE process and awarded all three badges by the committee. Each entry includes the original artifact (instructions, code, scripts, datasets/benchmarks, etc.), the original paper, and a collection of "oracle" scripts that define objective checkpoints at four canonical stages: environment setup, build/install, benchmark preparation, and experiment execution.
50
10
51
11
`ArtEvalBench` is designed to evaluate agents on capability (which stages they complete), efficiency (wall-clock time and intervention count), and fidelity (how closely reproduced results match those reported).
52
12
53
-
To check those capabilities, each artifact includes four oracle scripts that encode minimal, verifiable success criteria for each of the four stages. The oracles are invoked non-interactively and must be idempotent. Conceptually, these for stages correspond to:
13
+
To check those capabilities, each artifact includes four oracle scripts that encode minimal, verifiable success criteria for each of the four stages. The oracles are invoked non-interactively and must be idempotent. Conceptually, these four stages correspond to:
54
14
55
-
1. Environment Setup: verifies presence and versions of required tools, libraries, or other dependencies; confirms hardware availability when applicable; and checks that configurations are portable rather than hardcoded or tied to a specific machine.
56
-
2. Build/Install: confirms a complete build (or install) operation from a specified version, with expected binaries/modules present; running tests, when available, or simple validation commands like invoking `--help` or equivalent.
57
-
3. Benchmark Preparation: asserts that datasets/benchmarks are present and checksums match; verifies that necessary third-party tools compile and the artifact's instrumentation/monitoring hooks are enabled, if applicable.
58
-
4. Experiment Runs: executes each experiment according to the authors' guidelines; checks that the artifact produces the expected metrics, logs, files, figures, etc.; provides an initial assessment relative to specified tolerance bounds.
59
-
60
-
For a typical example, check out the [agent evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval/) of [WASABI](data/benchmark/sosp24_wasabi/wasabi/).
15
+
1.**Environment setup.** verifies presence and versions of required tools, libraries, or other dependencies; confirms hardware availability when applicable; and checks that configurations are portable rather than hardcoded or tied to a specific machine.
16
+
2.**Build (and install) the artifact.** confirms a complete build (or install) operation from a specified version, with expected binaries/modules present; running tests, when available, or simple validation commands like invoking `--help` or equivalent.
17
+
3.**Benchmark preparation.** asserts that datasets/benchmarks are present and checksums match; verifies that necessary third-party tools compile and the artifact's instrumentation/monitoring hooks are enabled, if applicable.
18
+
4.**Experiment runs.** executes each experiment according to the authors' guidelines; checks that the artifact produces the expected metrics, logs, files, figures, etc.; provides an initial assessment relative to specified tolerance bounds.
61
19
62
20
#### » Adding a new artifact
63
21
64
-
Adding a new artifact to the benchmark requires several steps:
22
+
Adding to the benchmark requires users to include a new entry into `ArtEvalBench`[schema file](data/benchmark/arteval_tasks.jsonl), where:
23
+
-`artifact_id` is a unique identifier for the artifact;
24
+
-`artifact_dir` the artifact directory within `data/benchmark/`;
25
+
-`artifact_readme` is the path to the artifact's README file that contains the step-by-step guide for preparing, installing, and running experiments;
26
+
-`artifact_url` the URL to the original artifact;
27
+
-`evaluator` is a path to the evaluator's `main.py` entrypoint;
28
+
-`expected_score` is the total expected score for this artifact, which defaults to 4 as the agent is evaluated on it succesfully completing the four canonical AE stages (!!NOTE!! We encourage users not to change this value, unless they opt for another universal metric for artifact evaluation).
29
+
-`docker_evn` (optional) points to a Docker image on Docker Hub.
30
+
31
+
It also requires users to extend the artifact they plan to add with a self-contained evaluator in an `_agent_eval/` directory. This evaluator encodes *minimal*, objective success criteria for the four canonical AE stages and is what the benchmark actually calls.
32
+
33
+
Using WASABI's [agent evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval/) as a template, users will therefore need to extend the artifact with:
34
+
35
+
1. An `_agent_eval/` package which contains all benchmark-specific code and does *not* modify your original artifact logic.
36
+
37
+
2. One oracle module per stage, implemented in four distinct Python files each checking one of the four canonical stages of artifact evaluation. A typical oracle module looks as follows (simplified):
- Non-interactive, meaning not expecting inputor prompt interactions.
60
+
- Idempotent, meaning safe to run multiple times without side-effects.
61
+
- It returns `True`or`False` based on the validation outcome and prints a brief diagnostic message.
62
+
63
+
3. A single `main.py` orchestrator, the entrypoint used by ArtEvalBench, which invokes the four oracle modules, runs them in order, and returns an overall score (an integer between 0and4):
64
+
```python
65
+
# _agent_eval/main.py
66
+
from . import env_setup, build_install, prep_benchmark, run_experiments
67
+
68
+
defmain() -> int:
69
+
score =0
70
+
stages = [
71
+
("env_setup", env_setup.check),
72
+
("build_install", build_install.check),
73
+
("prep_benchmark", prep_benchmark.check),
74
+
("run_experiments", run_experiments.check),
75
+
]
76
+
77
+
for name, check in stages:
78
+
try:
79
+
ok =bool(check())
80
+
exceptExceptionas e:
81
+
print(f"[{name}] FAILED with exception: {e}")
82
+
ok =False
83
+
84
+
if ok:
85
+
print(f"[{name}] PASSED")
86
+
score +=1
87
+
else:
88
+
print(f"[{name}] FAILED")
89
+
90
+
print(f"FINAL_SCORE {score}/4")
91
+
return score
92
+
93
+
if__name__=="__main__":
94
+
raiseSystemExit(main())
95
+
```
96
+
97
+
Note that the `ArtEvalBench` framework will invoke `main.py` to run the oracles in order, compute the agent's score for this particular artifact, and store it into a JSON file that aggregates these outcomes for the entire benchmark.
65
98
66
-
1. Create a stand-alone directory in `./data/benchmark` and copying all artifact files including the README file.
67
-
2. Implement oracles for evaluating the AI agent. This feature should follow the same structure as Wasabi's [evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval/), where each oracle is implemented in a separate Python source file and orchestrated by a `main.py` whose `main()` method returns a single integer, the overal score (0..4) the agent achieved.
68
-
3. Create an entry into the [task journal](data/benchmark/arteval_tasks.jsonl) and populate the appropriate fields.
69
99
70
100
## Benchmark Setup
71
101
72
-
#### » Install dependencies
73
-
74
-
To install the benchmark, simply run the `install.sh` script to set up the environment:
0 commit comments