Skip to content

[Question] How to identify task success/failure in the released leaderboard trajectories? #244

@ojipadeson

Description

@ojipadeson

Hi WebArena team,

I'm looking into the trajectory data released on the WebArena leaderboard (e.g., for various agents IBM CUGA, OpenAI Operator, etc.). I would like to perform some analysis on the successful vs. failed trajectories to understand agent behaviors better.

However, looking at the raw trajectory JSON files (e.g. the structure in test.raw.json or the agent output files), it is not immediately obvious which field indicates the final evaluation result (Success/Fail or Score 0/1).

Questions:

  1. Do the standard trajectory files released on the leaderboard include the evaluation result? If so, which specific key should I look for?
  2. If the trajectory files themselves do not contain the score, is there a separate manifest or summary file for each submission that maps task_id to its final score?
  3. Since ~50% of tasks require program_html (live environment) evaluation, is it correct to assume that we cannot re-evaluate these trajectories offline using only the JSON logs?

Any guidance on how to map the released trajectories to their pass/fail status would be appreciated.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions