Skip to content

Conversation

@bastoica
Copy link
Collaborator

Description

This PR adds the initial version ArtEvalBench, a curated collection of artifacts accompanying peer-reviewed papers. The goal of ArtEvalBench is to evaluate the capabilities of AI coding agents to set up, build, prepare, run experiments, and reproduce authors' results and main claims.

Changes

  • Added one example artifact
  • Defined benchmark structure, task format, and evaluation method/metric
  • Adapted wrapper code that interacts with a coding agent and scores its performance based on the artifact-specific validation oracles

Testing

Tested by manually running the agent framework against the example artifact.

Checklist

  • Tests pass locally
  • Code follows project style guidelines
  • Documentation updated (if needed)

@bastoica bastoica marked this pull request as ready for review November 19, 2025 09:12
@xuafeng xuafeng merged commit 535a78f into main Nov 19, 2025
2 checks passed
@xuafeng xuafeng deleted the arteval_benchmark branch November 19, 2025 17:02
Couen pushed a commit to Couen/system-intelligence-benchmark that referenced this pull request Jan 22, 2026
tareknaser pushed a commit that referenced this pull request Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants