Skip to content

Commit 8b05a67

Browse files
authored
refactor: rework the first paragraph and fix minor text redering issues
1 parent db6a947 commit 8b05a67

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

benchmarks/arteval_bench/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# ArtEvalBench
22

3-
`ArtEvalBench` is a benchmark for evaluating AI agents that support the Artifact Evaluation (AE) process by auditing research prototypes (artifacts) that accompany research papers, as part of the peer-review process ([why artifact evaluation?](WHY.md)). Artifact evaluation involves reconstructing a reference environment from (partial) specifications, building and configuring complex codebases with often implicit assumptions, preparing datasets and third-party benchmarks whose availability may change over time, orchestrating multi-stage experiments under controlled resource and time budgets, and validating that observed results fall within acceptable tolerance bounds relative to those reported in the paper. Despite the intricacy of the process, we believe AI agents can be trained to support reviewers in evaluating artifacts that accompany research papers by automating most of these stages.
3+
`ArtEvalBench` is a benchmark for evaluating AI agents against Artifact Evaluation (AE) tasks ([why artifact evaluation?](WHY.md)). We believe that, despite the complexity of the AE process, AI agents can be succesfully trained to automatically evaluate artifacts that accompany research papers.
44

55
## Contributor's guide
66

@@ -25,7 +25,7 @@ Adding to the benchmark requires users to include a new entry into `ArtEvalBench
2525
- `artifact_readme` is the path to the artifact's README file that contains the step-by-step guide for preparing, installing, and running experiments;
2626
- `artifact_url` the URL to the original artifact;
2727
- `evaluator` is a path to the evaluator's `main.py` entrypoint;
28-
- `expected_score` is the total expected score for this artifact, which defaults to 4 as the agent is evaluated on it succesfully completing the four canonical AE stages ([!NOTE] Users are encouraged not to change this value, unless they opt for another universal metric for artifact evaluation).
28+
- `expected_score` is the total expected score for this artifact, which defaults to 4 as the agent is evaluated on it succesfully completing the four canonical AE stages (!!NOTE!! We encourage users not to change this value, unless they opt for another universal metric for artifact evaluation).
2929
- `docker_evn` (optional) points to a Docker image on Docker Hub.
3030

3131
It also requires users to extend the artifact they plan to add with a self-contained evaluator in an `_agent_eval/` directory. This evaluator encodes *minimal*, objective success criteria for the four canonical AE stages and is what the benchmark actually calls.

0 commit comments

Comments
 (0)