Skip to content

Rescore submissions#147

Open
stefanc-ai2 wants to merge 11 commits intomainfrom
rescore-submissions
Open

Rescore submissions#147
stefanc-ai2 wants to merge 11 commits intomainfrom
rescore-submissions

Conversation

@stefanc-ai2
Copy link
Copy Markdown
Contributor

@stefanc-ai2 stefanc-ai2 commented Apr 23, 2026

Code for https://github.com/allenai/nora-issues/issues/2760

Script to rescore plus documentation.

@stefanc-ai2 stefanc-ai2 self-assigned this Apr 23, 2026
@stefanc-ai2 stefanc-ai2 requested a review from mdarcy220 April 28, 2026 19:19
set -euo pipefail

# Edit this regex to change which submission logs are targeted for rescoring.
TARGET_LOG_REGEX='(sqa-(test|dev)|e2e-discovery(-hard)?-(test|validation))'
Copy link
Copy Markdown
Contributor

@mdarcy220 mdarcy220 Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two snags:

  • This regex doesn't pick up the sqa-specific solvers since those submitted eval files are renamed; the more reliable way (used by astabench score) is to look inside the files to check the stored task name.
  • This only captures SQA and E2E tasks because those are the scorers that changed in this round of model deprecations, but they are not the only two tasks that use LLM judges.

IMO the ideal fix would be to change the detection method and make it more configurable for future rescores. Alternatively, I think it would be okay to start it with echo "This script is a template and needs to be adapted to your use case. Edit and try again"; exit 1 so that it's obvious to future users that it needs the per-use-case adaptation (maybe also include a comment about the regex detection caveat).

Comment thread README.md

The scorer project is pinned independently in [`solvers/scorer/pyproject.toml`](solvers/scorer/pyproject.toml).

### Rescoring Existing Submissions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably move to INTERNAL

Comment thread pyproject.toml
dependencies = [
"inspect_ai==0.3.203",
"agent-eval[leaderboard]==0.1.47",
"agent-eval[leaderboard]==0.1.49",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"agent-eval[leaderboard]==0.1.49",
"agent-eval[leaderboard]==0.1.50",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants