Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
46c7914
Add Phase 0 leaderboard canonical models and specs
Am1n3e Feb 7, 2026
1dbcc6e
Add leaderboard submission PR CI validator with fail-closed rate limits
Am1n3e Feb 7, 2026
d6b7797
Implement Track A control-plane transitions and validation
Am1n3e Feb 7, 2026
6cbded5
Merge branch 'codex/leaderboard-track-a-control' into add-leadearboar…
Am1n3e Feb 7, 2026
b24cc00
Address Track B review: move validator to dev and harden PR checks
Am1n3e Feb 7, 2026
1a0bd03
Implement Astro leaderboard site with tabbed rankings and styled resp…
Am1n3e Feb 7, 2026
da1fd4f
Refine leaderboard controls alignment and hide page size label
Am1n3e Feb 7, 2026
09a4c2a
Merge branch 'codex/leaderboard-track-b-ci' into add-leadearboard-site
Am1n3e Feb 7, 2026
f99cb8e
Address Track C review: base-aware site paths and schema parity
Am1n3e Feb 7, 2026
3037c70
Merge branch 'codex/leaderboard-track-c-site' into add-leadearboard-site
Am1n3e Feb 7, 2026
0da507b
Implement Track D atomic leaderboard publish workflow and dev tooling
Am1n3e Feb 7, 2026
2974072
Merge branch 'codex/leaderboard-track-d-docs-e2e' into add-leadearboa…
Am1n3e Feb 7, 2026
c882e67
Implement HF PR-only scheduled leaderboard ingestion
Am1n3e Feb 8, 2026
bf7685a
Seed leaderboard integration branch with full implementation baseline
Am1n3e Feb 8, 2026
f944845
Refactor leaderboard CI validation modules (#26)
Am1n3e Mar 7, 2026
ae39d24
Add leaderboard implementation planning docs
Am1n3e Mar 7, 2026
c3008fe
Allow leaderboard manifest URL override via env
Am1n3e Mar 7, 2026
d4fe3aa
Freeze Lane B contracts around integer submission IDs (#31)
Am1n3e Mar 7, 2026
79a6a63
Add read-only PR gate intake validation (#33)
Am1n3e Mar 7, 2026
f1a6970
Implement Lane D finalize workflow with canonical records (#32)
Am1n3e Mar 7, 2026
1e17610
Implement Lane E canonical rebuild single-writer flow (#34)
Am1n3e Mar 7, 2026
c7342eb
Document Lane A governance plan and execution status (#30)
Am1n3e Mar 7, 2026
a757128
Complete Lane F source wiring and Lane G contract coverage (#35)
Am1n3e Mar 7, 2026
6c392d9
Harden leaderboard submission pipeline and remove legacy HF sync path
Am1n3e Mar 7, 2026
a91b8a6
Remove legacy leaderboard workflows and processed publish path
Am1n3e Mar 7, 2026
4904e66
Move leaderboard scripts and inline rebuild into finalize workflow
Am1n3e Mar 7, 2026
d244f35
Harden leaderboard finalize trust boundaries and public row contract
Am1n3e Mar 7, 2026
27d1972
Fix leaderboard site bugs and improve visual polish
Am1n3e Mar 7, 2026
8612ad7
Clean up redundant content and add sticky footer
Am1n3e Mar 7, 2026
6b019b8
Replace legacy leaderboard intake with HF-direct control flow
Am1n3e Mar 7, 2026
da0029a
Remove FAQ page, fix full-width layout, nav highlight, and table alig…
Am1n3e Mar 7, 2026
75a006c
Remove temporary plan docs and apply review cleanups
Am1n3e Mar 7, 2026
a47e114
Remove tracked .wip docs and rename file lock helper
Am1n3e Mar 7, 2026
8dfa83c
Remove leaderboard spec docs from branch
Am1n3e Mar 7, 2026
2a0754a
Separate CI leaderboard schemas from submission types
Am1n3e Mar 7, 2026
51ad00a
Remove redundant leaderboard row field checks
Am1n3e Mar 7, 2026
ec238dd
Remove webhook bridge and unused leaderboard templates
Am1n3e Mar 7, 2026
fea946b
Consolidate leaderboard types into core modules
Am1n3e Mar 7, 2026
c9222b3
Remove dead utility modules from leaderboard scripts
Am1n3e Mar 7, 2026
4b1befc
Simplify leaderboard type validators and payload contracts
Am1n3e Mar 7, 2026
21611ac
Add backward-compat key stripping for submission control records
Am1n3e Mar 7, 2026
b05997a
Simplify publish path and ingest workflow
Am1n3e Mar 7, 2026
ef02ad4
Fix stale-event submission_uid consistency in HF ingest
Am1n3e Mar 7, 2026
fd20c96
Remove stale site lockfile and update AGENTS docs
Am1n3e Mar 7, 2026
9e64a38
Remove tar archive generation from create-submission-pkg
Am1n3e Mar 7, 2026
46c6ea1
Add leaderboard submit CLI and documentation
Am1n3e Mar 8, 2026
425f4c0
Refactor submission handlers for two-step create-then-submit workflow
Am1n3e Mar 8, 2026
0c2e5fa
Update CLI args for two-step submission workflow
Am1n3e Mar 8, 2026
2ccf696
Update submission docs for two-step workflow
Am1n3e Mar 8, 2026
f233b2e
Refactor leaderboard workflows to use script entrypoints
Am1n3e Mar 8, 2026
fa1c86b
Simplify submission packaging metadata and validation
Am1n3e Mar 8, 2026
268fd4c
Refine submission metadata schema and validation
Am1n3e Mar 8, 2026
4e741ab
Harden vNext submission ingest and leaderboard rebuild flow
Am1n3e Mar 8, 2026
638f83f
Fix deterministic rebuilds and coverage-aware evaluation scoring
Am1n3e Mar 8, 2026
865e44e
Remove leaderboard smoke workflow and related code
Am1n3e Mar 8, 2026
94ad740
Remove manual leaderboard rebuild workflow and task wiring
Am1n3e Mar 8, 2026
20cb52b
Simplify leaderboard CI to single ingest workflow
Am1n3e Mar 8, 2026
a30761c
Consolidate leaderboard script modules
Am1n3e Mar 8, 2026
6e765bd
Move leaderboard-only code from client package to leaderboard scripts
Am1n3e Mar 8, 2026
c862036
Add docstrings to all classes and public functions in leaderboard scr…
Am1n3e Mar 8, 2026
30cf9c4
Unify docs and leaderboard deploy into single gh-pages workflow
Am1n3e Mar 8, 2026
ec95265
Add Pydantic settings config for leaderboard scripts
Am1n3e Mar 8, 2026
5d2068d
Add HF API backend facade for submission data access
Am1n3e Mar 8, 2026
1c0f058
Refactor leaderboard builder with deduplication and logging
Am1n3e Mar 8, 2026
3541682
Simplify ingest pipeline to direct function calls
Am1n3e Mar 8, 2026
0a4bea9
Update rebuild tests to use LeaderboardBuilder directly
Am1n3e Mar 8, 2026
4459ef1
Rename ingest_hf_submission to sync_submission and reuse backend inst…
Am1n3e Mar 8, 2026
c54e999
Make backend a required parameter in sync_submission
Am1n3e Mar 8, 2026
a85d1e2
Add docstring to _locate_submission_directory
Am1n3e Mar 8, 2026
427b7b5
Construct leaderboard artifacts via Pydantic models instead of raw dicts
Am1n3e Mar 8, 2026
088f496
Consolidate score computation into core TasksEvalResults
Am1n3e Mar 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions .github/workflows/deploy-site.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
name: Deploy Site

on:
push:
branches:
- main
paths:
- "docs/**"
- "mkdocs.yml"
- "leaderboard/site/**"
- ".github/workflows/deploy-site.yml"
workflow_dispatch:
inputs:
version:
description: "Docs version to deploy (e.g., 1.0.0 or dev)"
required: false
default: "dev"
manifest_url_override:
description: "Optional leaderboard manifest URL override"
required: false
default: ""

permissions:
contents: write

concurrency:
group: "pages"
cancel-in-progress: false

jobs:
deploy:
name: Deploy docs and leaderboard
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Setup environment
uses: ./.github/actions/setup
with:
sync-args: --group dev
configure-git-identity: "true"

- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: "20"
cache: npm
cache-dependency-path: leaderboard/site/package-lock.json

- name: Install leaderboard site dependencies
run: npm ci
working-directory: leaderboard/site

- name: Fetch gh-pages branch
run: |
git fetch origin gh-pages --depth=1 || echo "gh-pages branch does not exist yet"

# Step 1: mike deploy (no --push) updates local gh-pages ref
- name: Deploy docs to local gh-pages
run: uv run mike deploy ${{ inputs.version || 'dev' }}

# Step 2: Build leaderboard Astro app
- name: Resolve manifest URL
id: cfg
env:
MANIFEST_URL_OVERRIDE: ${{ inputs.manifest_url_override || '' }}
CONFIGURED_MANIFEST_URL: ${{ vars.PUBLIC_LEADERBOARD_MANIFEST_URL }}
run: |
set -euo pipefail
MANIFEST_URL="$(uv run inv dev.leaderboard.site-resolve-manifest-url \
--manifest-url-override "$MANIFEST_URL_OVERRIDE" \
--configured-manifest-url "$CONFIGURED_MANIFEST_URL")"
echo "manifest_url=$MANIFEST_URL" >> "$GITHUB_OUTPUT"

- name: Build leaderboard site
env:
PUBLIC_LEADERBOARD_MANIFEST_URL: ${{ steps.cfg.outputs.manifest_url }}
run: npm run build
working-directory: leaderboard/site

# Step 3: Inject leaderboard dist into gh-pages and push
- name: Inject leaderboard and push gh-pages
run: |
set -euo pipefail

# Checkout gh-pages into a worktree
git worktree add _gh-pages gh-pages

# Remove old leaderboard dir (if any) and copy fresh build
rm -rf _gh-pages/leaderboard
cp -r leaderboard/site/dist _gh-pages/leaderboard

# Commit and push
cd _gh-pages
git add -A
if git diff --cached --quiet; then
echo "No changes to deploy" >> "$GITHUB_STEP_SUMMARY"
else
git commit -m "Deploy docs and leaderboard"
git push origin gh-pages
echo "Deployed docs and leaderboard to gh-pages" >> "$GITHUB_STEP_SUMMARY"
fi

- name: Deploy summary
run: |
echo "docs_version=${{ inputs.version || 'dev' }}" >> "$GITHUB_STEP_SUMMARY"
echo "manifest_url=${{ steps.cfg.outputs.manifest_url }}" >> "$GITHUB_STEP_SUMMARY"
48 changes: 0 additions & 48 deletions .github/workflows/dev-docs-publish.yml

This file was deleted.

51 changes: 51 additions & 0 deletions .github/workflows/leaderboard-hf-ingest.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
name: Leaderboard HF Ingest

on:
schedule:
- cron: "*/30 * * * *"

permissions:
contents: write

jobs:
sync_submissions:
name: Sync submissions from HF schedule trigger
runs-on: ubuntu-latest
concurrency:
group: leaderboard-data-writer
cancel-in-progress: false
steps:
- name: Checkout leaderboard-submissions
uses: actions/checkout@v4
with:
ref: leaderboard-submissions
fetch-depth: 0

- name: Setup environment
uses: ./.github/actions/setup
with:
sync-args: --group dev
configure-git-identity: "true"

- name: Sync pending submissions from HF
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
WEBARENA_VERIFIED_LEADERBOARD_SUBMISSION_HF_REPO: ${{ vars.WEBARENA_VERIFIED_LEADERBOARD_SUBMISSION_HF_REPO || vars.LEADERBOARD_HF_REPO || 'AmineHA/WebArena-Verified-Submissions-dev' }}
run: |
set -euo pipefail
uv run inv dev.leaderboard.hf-sync-submissions \
--repo-root "." \
--hf-repo "$WEBARENA_VERIFIED_LEADERBOARD_SUBMISSION_HF_REPO" \
--hf-token "$HF_TOKEN"

- name: Commit and push leaderboard updates
run: |
set -euo pipefail
git add -A leaderboard
if git diff --cached --quiet; then
echo "No submission sync changes" >> "$GITHUB_STEP_SUMMARY"
exit 0
fi

git commit -m "Sync HF submissions and rebuild leaderboard"
git push origin HEAD:leaderboard-submissions
67 changes: 67 additions & 0 deletions .github/workflows/leaderboard-site-build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
name: Leaderboard Site Build

on:
pull_request:
paths:
- "leaderboard/site/**"
- ".github/workflows/leaderboard-site-build.yml"
workflow_dispatch:
inputs:
manifest_url_override:
description: "Optional manifest URL override for this run"
required: false
default: ""

permissions:
contents: read

concurrency:
group: leaderboard-site-build-${{ github.ref }}
cancel-in-progress: true

jobs:
build:
name: Build leaderboard site with production manifest source
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: "20"
cache: npm
cache-dependency-path: leaderboard/site/package-lock.json

- name: Install site dependencies
run: npm ci
working-directory: leaderboard/site

- name: Resolve manifest URL
id: cfg
env:
MANIFEST_URL_OVERRIDE: ${{ github.event.inputs.manifest_url_override }}
CONFIGURED_MANIFEST_URL: ${{ vars.PUBLIC_LEADERBOARD_MANIFEST_URL }}
run: |
set -euo pipefail
MANIFEST_URL="$(uv run inv dev.leaderboard.site-resolve-manifest-url \
--manifest-url-override "$MANIFEST_URL_OVERRIDE" \
--configured-manifest-url "$CONFIGURED_MANIFEST_URL")"
echo "$MANIFEST_URL"
echo "manifest_url=$MANIFEST_URL" >> "$GITHUB_OUTPUT"

- name: Run site tests
run: npm test
working-directory: leaderboard/site

- name: Build site
env:
PUBLIC_LEADERBOARD_MANIFEST_URL: ${{ steps.cfg.outputs.manifest_url }}
run: npm run build
working-directory: leaderboard/site

- name: Build summary
run: |
echo "manifest_url=${{ steps.cfg.outputs.manifest_url }}" >> "$GITHUB_STEP_SUMMARY"
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,8 @@ __marimo__/
output/
scratch/
*.wip.json
.sisyphus/
.wip/
*.json.backup
*.json.bkp
*.tmp.json
Expand Down
33 changes: 33 additions & 0 deletions docs/leaderboard/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Leaderboard

The WebArena-Verified leaderboard is the public results table for benchmark submissions.
Each entry is produced from deterministic offline evaluation and linked to a submission record.

- Live leaderboard: https://servicenow.github.io/webarena-verified/leaderboard/
- Boards:
- `WebArena-Verified` (full benchmark, 812 tasks)
- `WebArena-Verified-Hard` (hard subset, 258 tasks)

## What You Can Inspect

Each row includes:

- Rank
- Name
- Overall Score
- Per-site scores: Shopping, Shopping Admin, Reddit, GitLab, Wikipedia, Map
- Submission ID
- Evaluator Version

## UI Features

- Search and filter by submission name or submission ID
- Toggle between Full and Hard boards
- Export leaderboard rows as CSV

!!! info
Evaluation is deterministic and offline. WebArena-Verified does not use LLM-as-judge scoring for leaderboard entries.

## Next Step

To publish results, follow the submission walkthrough: [Submitting Results](submission.md).
Loading
Loading