asta-bench/INTERNAL.md at main · allenai/asta-bench

This document is targeted at Ai2 internal devs who are creating new evaluations or agents for AstaBench.

Submitting to the leaderboard

Private Ai2 submission flow (sample); you'll also need an HF_TOKEN for an account that is a member of the Ai2 huggingface org:

# Step 1. Publish to submissions repo.
astabench publish logs/2025-01-15-myagent-run/ \
  --agent-name "MyAgent" \
  --agent-description "v1.0" \
  --openness ow \  # ow=open-weight
  --tool-usage s \  # s=standard
  --submissions-repo-id allenai/asta-bench-internal-submissions

# Output: hf://allenai/asta-bench-internal-submissions/1.0.0-dev1/validation/username_MyAgent_2025-01-01T00-00-00
# This step generates a submission.json file with your agent name, description and details and includes the file with your logs and config in the submissions repo

# Step 2. Score the remote submission (use path from step 1).
astabench score hf://allenai/asta-bench-internal-submissions/1.0.0-dev1/validation/username_MyAgent_2025-01-01T00-00-00
# This will write files scores.json and summary_stats.json to a summaries directory in the submissions repo; the scores file is needed in this location to publish results to leaderboard

# Step 3. Publish data to display in the leaderboard UI.
astabench lb publish hf://allenai/asta-bench-internal-submissions/1.0.0-dev1/validation/username_MyAgent_2025-01-01T00-00-00
# This step expects eval_config.json and submission.json to be available in your submission files and scores.json to be present in the submission summaries directory. It integrates these files for publication to the leaderboard.

Note to Ai2 internal users: Steps 2 and 3 may be run centrally, e.g. to re-score with up-to-date costs and make controlled updates to leaderboard display data.

For bulk runs of already scored results, it is possible to run: cat filenames_to_publish.txt | xargs astabench lb publish

Adding a Benchmark

InspectAI has extensive documentation on how to implement task code; this section covers additional considerations when adding a task to AstaBench to ensure compatibility with suite expectations.

To add a new task to AstaBench, you should implement it as an InspectAI task. In general, the new task should go in a new directory in astabench/evals, unless it is a direct extension or variation of an existing task (e.g., a new set of annotated data with the same format/scoring as the original task).

After creating the task, be sure to register it in astabench/evals/_registry.py so that it can be discovered by InspectAI as astabench/[YOUR_TASK_NAME]; this should be done for all tasks, regardless of whether they will be part of the official set of AstaBench tasks run via astabench eval.

Tasks in the official astabench eval set will additionally need to be added to the latest astabench/config/v*.yml file; this requires establishing a single "primary" metric for the task, and creating separate task definitions for validation and test splits (necessary because there is no way to pass task-arguments to the underlying inspect eval-set).

For an example of a custom eval, see astabench/evals/demo/arithmetic in this repo or other InspectAI evals in inspect_evals. The directory includes examples of a simple eval (astabench/evals/demo/arithmetic/task.py), an eval that checks tool calls (astabench/evals/demo/arithmetic/task_with_tools.py), and an eval that checks a stylistic rubric (astabench/evals/demo/arithmetic/task_with_rubric.py).

For more complex tasks that require a custom scorer, check the Inspect documentation.

Input/output format

To ensure that general agents can run on the tasks, task samples should always be solvable based on chat input + tools, where a "chat" is text/image/audio that can be fed directly to a (possibly multimodal) LLM and "tools" are inspect_ai.Tools passed via state.tools. That is:

the user input chat should contain all necessary task instructions/answer formatting instructions
- corrolary for solver design: general solvers should not have task-specific instructions or parameters, though they may have generic problem-solving instructions (e.g. "think step-by-step")
it should be possible to achieve a perfect score by reading the input, performing some actions with state.tools and observing results, and setting state.output.completion.
- it is ideal if the task can be fully judged based on state.output.completion; this is not always feasible, so some tasks may judge intermediate states of tool calls or sandbox files if they provide the necessary tools for manipulating those
  - e.g. see how super scores code trajectory
  - a solver may internally let the model use some kind of submit() tool to submit the answer; that tool can set the state.output.completion chat to the answer
If the task depends on some large structured data source (e.g. a database), the task should provide a tool (e.g. via use_tools(...) in the task setup= parameter) that the agent can use to access the resource
it should not be necessary for solvers to read/write task-specific structured data like state.choices/state.store/state.metadata.
- E.g., a task should not expect to read the answer from state.store["answer"], unless it provides the agent with a tool that can set that field
the input/output chats may be structured (e.g. JSON) as long as they are consumable by a chat LLM

For multiple choice tasks, there is a format_multichoice_as_textin utility that can be used to convert Inspect's sample.choices into a pure text format as required (used by e.g. litqa2).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submitting to the leaderboard

Adding a Benchmark

Input/output format

FilesExpand file tree

INTERNAL.md

Latest commit

History

INTERNAL.md

File metadata and controls

Submitting to the leaderboard

Adding a Benchmark

Input/output format