Skip to content

Latest commit

 

History

History
128 lines (98 loc) · 3.92 KB

File metadata and controls

128 lines (98 loc) · 3.92 KB

Submitting Results

This guide shows the full submission flow from run outputs to leaderboard ingestion.

Prerequisites

Your run output should contain task folders in this shape:

<run_output_dir>/
  <task_id>/
    agent_response.json
    network.har

You can submit partial coverage. Package creation reports leaderboard coverage counts for valid, incomplete, and missing tasks.

Step 1 - Create Submission Package

uvx webarena-verified create-submission-pkg \
  --run-output-dir ./output \
  --output ./my-submission \
  --leaderboard both

If you prefer, you can run the same CLI through a local install, uv run, or Docker.

The --output path is the submission package directory itself. If it already exists, the command will fail unless you pass --force to overwrite it.

Expected result: a ./my-submission/ folder containing task folders, submission.json, and manifest.json.

create-submission-pkg now embeds coverage stats in submission.json under packaged_tasks:

  • valid: tasks with both required files
  • incomplete: tasks with exactly one required file
  • missing: tasks with no files or no task directory
  • expected: total expected tasks for the leaderboard scope

Step 2 - Edit submission.json

Open submission.json and replace the placeholder values for name, model, reference, and contact_email:

{
  "name": "MySystem-v1",
  "model": "gpt-4.1-mini",
  "leaderboard": "both",
  "reference": "https://example.com/paper",
  "code_repository": "https://github.com/org/repo",
  "contact_email": "team@example.com",
  "packaged_tasks": {
    "full": {
      "valid": 750,
      "incomplete": 12,
      "missing": 50,
      "expected": 812
    },
    "hard": {
      "valid": 230,
      "incomplete": 5,
      "missing": 23,
      "expected": 258
    }
  }
}
Field Required Description
name Yes Submission name (e.g. MySystem-v1)
model Yes Model identifier used for this submission (e.g. gpt-4.1-mini)
leaderboard Yes Auto-filled from create-submission-pkg --leaderboard; do not change unless you recreate the package
reference Yes HTTP(S) URL to paper or model reference
code_repository No HTTP(S) URL to the agent code repository
contact_email Yes Contact email used only for submission-maintenance communication
packaged_tasks Yes Auto-filled coverage summary; do not edit

!!! info "How contact_email is used" contact_email is used only to contact the submission author when a modification to the submission is required. Maintainers may use it to verify that modification requests come from the original author. If you prefer not to share a real address, you can use a dummy value.

Step 3 - Submit To Leaderboard

uvx webarena-verified submit \
  --submission-dir ./my-submission

What this command does:

  1. Validates the package and reads your submission.json.
  2. Regenerates manifest.json for integrity.
  3. Uploads the payload to HuggingFace and creates a dataset PR.

Expected output includes the HuggingFace PR URL, for example:

PR URL: https://huggingface.co/datasets/<org>/<repo>/discussions/<N>

!!! info "Authentication" The submit command requires HuggingFace authentication. Use either method:

```bash
# Option 1: Login via CLI (persistent)
hf auth login

# Option 2: Set token as environment variable
export HF_TOKEN=hf_...
```

What Happens Next

The automated ingestion pipeline runs every 30 minutes:

flowchart LR
    A[Submit CLI] --> B[HF dataset PR created]
    B --> C[Ingestion job validates payload]
    C --> D[Deterministic evaluation by task and site]
    D --> E[Canonical records updated]
    E --> F[Leaderboard artifacts published]
Loading

If your submission fails ingestion, review the PR payload and retry with a corrected package. For support, open an issue in the WebArena-Verified repository.