Adding ArtEvalBench v0.9 #15

bastoica · 2025-11-19T09:00:32Z

Description

This PR adds the initial version ArtEvalBench, a curated collection of artifacts accompanying peer-reviewed papers. The goal of ArtEvalBench is to evaluate the capabilities of AI coding agents to set up, build, prepare, run experiments, and reproduce authors' results and main claims.

Changes

Added one example artifact
Defined benchmark structure, task format, and evaluation method/metric
Adapted wrapper code that interacts with a coding agent and scores its performance based on the artifact-specific validation oracles

Testing

Tested by manually running the agent framework against the example artifact.

Checklist

Tests pass locally
Code follows project style guidelines
Documentation updated (if needed)

…ding several popular code-focused AI agents

…benchmark Adding ArtEvalBench v0.9

Adding ArtEvalBench v0.9

bastoica added 14 commits November 12, 2025 17:15

adding overview and contributor's guide

1799370

skeleton ArtEval agent implementation

2054314

adding sosp24 wasabi

6303aa5

fix: addressing compilation issues and some format compatibility issues

d79f851

feature: consolidated AI agent oracles

f9c3552

feature: add .gitignore

949a3da

fix: cleaning up _agent_eval directory

dd49df2

fix: adapt Wasabi's README file to exclude git instructions

41e9ee1

fix: refactor JSON containing task description

66e0107

fix: refactor the AI agent evaluator bundle to return the final score

d593df2

fix: unitialized variable

07935ff

refactor: major refactoring of the ArtEval agent scaffolding; also ad…

51fac0f

…ding several popular code-focused AI agents

fix: correcting the name of the benchmark in the initial log message

0fba806

feature: improving and expanding the contributor's guide

ddac095

bastoica requested a review from xuafeng November 19, 2025 09:00

bastoica mentioned this pull request Nov 19, 2025

Add ArtEvalBench V0.9 and a step-by-step guide about how to contribute #8

Closed

bastoica marked this pull request as draft November 19, 2025 09:09

fix: rebase the repository's top-level README

ffedcc0

bastoica marked this pull request as ready for review November 19, 2025 09:12

xuafeng merged commit 535a78f into main Nov 19, 2025
2 checks passed

xuafeng deleted the arteval_benchmark branch November 19, 2025 17:02

bastoica mentioned this pull request Nov 20, 2025

Add an artifact that requires time series comparisons (e.g., from a line graph) to confirm results were reproduced #19

Closed

Couen pushed a commit to Couen/system-intelligence-benchmark that referenced this pull request Jan 22, 2026

Merge pull request sys-intelligence#15 from sys-intelligence/arteval_…

683dbc8

…benchmark Adding ArtEvalBench v0.9

tareknaser pushed a commit that referenced this pull request Feb 5, 2026

Merge pull request #15 from sys-intelligence/arteval_benchmark

4bfbb8a

Adding ArtEvalBench v0.9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding ArtEvalBench v0.9 #15

Adding ArtEvalBench v0.9 #15

Uh oh!

bastoica commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding ArtEvalBench v0.9 #15

Adding ArtEvalBench v0.9 #15

Uh oh!

Conversation

bastoica commented Nov 19, 2025

Description

Changes

Testing

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants