AgingBench Leaderboard

We accept AgingCard JSON submissions via GitHub. One card per (model × scenario × seed). Schema validation + reproducibility are the only acceptance criteria — no commercial endorsement, no single-scalar ranking.

Tracks

Track	What you submit	Anchor
Tier 2 — Autonomous agent (S7)	Agent that owns its session loop (Claude Code, OpenHands, custom)	Default surface; recall is the headline metric
Tier 1 · A — Model swap	LLM; `ReferenceAgent` + memory policy + seeds held constant	"Which model ages better under identical scaffolding?"
Tier 1 · B — Memory policy	`MemoryPolicy` subclass; model held at Haiku-4.5 baseline	Memory-systems research + retrieval/compression studies
Tier 1 · C — Controller	`ThresholdController` subclass	Opens in v1.1

Submitting

Generate a card:

cd prototype
uv run agingbench run \
  --scenario <s1_…|s7_research_notes> \
  --sut <your-sut.yaml> \
  --generated --sessions 10 --seeds 3 --card

Validate against the schema:

python -m agingbench.metrics.aging_card_validate \
  experiments/results/<run-dir>/aging_card.json

Open a GitHub issue with the AgingCard submission template, attach the card, and note the track (A / B / C / Tier 2).

If your run uses a forked AgingBench (modified scoring, custom scenarios), declare it on submission — fork cards are tagged and stored separately so the main leaderboard reflects only canonical runs.

What we commit to

Weekly review of submitted AgingCards.
8-week release cadence for the codebase.
Backward-compatible AgingCard schema evolution across minor versions.
Transparent governance: if a submission turns out to be misrepresented, the card moves to leaderboard/_retracted/ with the dispute reasoning. Nothing is silently deleted.

Disputes

Open a [dispute] issue if a card misrepresents the model. Two leaderboard operators review; if upheld, the card moves to leaderboard/_retracted/ with the dispute reasoning attached.

Eligibility

Card must validate against aging_card_schema.json.
Provenance git_sha must match a published AgingBench release (or be a disclosed fork).
Conflict-of-interest disclosure required if the submitter is affiliated with the model provider.
One card per (model × scenario × seed). Newer cards supersede older ones via a superseded_by link rather than replacing them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AgingBench Leaderboard

Tracks

Submitting

What we commit to

Disputes

Eligibility

FilesExpand file tree

LEADERBOARD.md

Latest commit

History

LEADERBOARD.md

File metadata and controls

AgingBench Leaderboard

Tracks

Submitting

What we commit to

Disputes

Eligibility