Rescore submissions by stefanc-ai2 · Pull Request #147 · allenai/asta-bench

stefanc-ai2 · 2026-04-23T17:00:43Z

Code for https://github.com/allenai/nora-issues/issues/2760

Script to rescore plus documentation.

mdarcy220 · 2026-04-28T20:17:37Z

+set -euo pipefail
+
+# Edit this regex to change which submission logs are targeted for rescoring.
+TARGET_LOG_REGEX='(sqa-(test|dev)|e2e-discovery(-hard)?-(test|validation))'


Two snags:

This regex doesn't pick up the sqa-specific solvers since those submitted eval files are renamed; the more reliable way (used by astabench score) is to look inside the files to check the stored task name.

This only captures SQA and E2E tasks because those are the scorers that changed in this round of model deprecations, but they are not the only two tasks that use LLM judges.

IMO the ideal fix would be to change the detection method and make it more configurable for future rescores. Alternatively, I think it would be okay to start it with echo "This script is a template and needs to be adapted to your use case. Edit and try again"; exit 1 so that it's obvious to future users that it needs the per-use-case adaptation (maybe also include a comment about the regex detection caveat).

mdarcy220 · 2026-04-28T20:20:49Z


 The scorer project is pinned independently in [`solvers/scorer/pyproject.toml`](solvers/scorer/pyproject.toml).

+### Rescoring Existing Submissions


Probably move to INTERNAL

mdarcy220 · 2026-04-28T20:23:36Z

 dependencies = [
    "inspect_ai==0.3.203",
-    "agent-eval[leaderboard]==0.1.47",
+    "agent-eval[leaderboard]==0.1.49",


Suggested change

"agent-eval[leaderboard]==0.1.49",

"agent-eval[leaderboard]==0.1.50",

stefanc-ai2 added 8 commits April 20, 2026 10:33

rescore

6de23f2

progress script

7f0fca4

score delta

5dbba04

better rescore

7a95ea0

resume mode

f45c29b

fix

acbba81

add readme

bec3b2f

normalize

3ab6468

stefanc-ai2 self-assigned this Apr 23, 2026

stefanc-ai2 added 3 commits April 23, 2026 10:52

format

066a38c

up agent eval

198944b

scorer uv lock

7be6006

stefanc-ai2 requested a review from mdarcy220 April 28, 2026 19:19

mdarcy220 reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rescore submissions#147

Rescore submissions#147
stefanc-ai2 wants to merge 11 commits intomainfrom
rescore-submissions

stefanc-ai2 commented Apr 23, 2026 •

edited

Loading

Uh oh!

mdarcy220 Apr 28, 2026 •

edited

Loading

Uh oh!

mdarcy220 Apr 28, 2026

Uh oh!

mdarcy220 Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		The scorer project is pinned independently in [`solvers/scorer/pyproject.toml`](solvers/scorer/pyproject.toml).

		### Rescoring Existing Submissions

	"agent-eval[leaderboard]==0.1.49",
	"agent-eval[leaderboard]==0.1.50",

Conversation

stefanc-ai2 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdarcy220 Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdarcy220 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

mdarcy220 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stefanc-ai2 commented Apr 23, 2026 •

edited

Loading

mdarcy220 Apr 28, 2026 •

edited

Loading