[Question] How to identify task success/failure in the released leaderboard trajectories?

Hi WebArena team,

I'm looking into the trajectory data released on the WebArena leaderboard (e.g., for various agents IBM CUGA, OpenAI Operator, etc.). I would like to perform some analysis on the successful vs. failed trajectories to understand agent behaviors better.

However, looking at the raw trajectory JSON files (e.g. the structure in `test.raw.json` or the agent output files), it is not immediately obvious which field indicates the final evaluation result (Success/Fail or Score 0/1).

**Questions:**
1.  Do the standard trajectory files released on the leaderboard include the evaluation result? If so, which specific key should I look for?
2.  If the trajectory files themselves do not contain the score, is there a separate manifest or summary file for each submission that maps `task_id` to its final score?
3.  Since ~50% of tasks require `program_html` (live environment) evaluation, is it correct to assume that we cannot re-evaluate these trajectories offline using only the JSON logs?

Any guidance on how to map the released trajectories to their pass/fail status would be appreciated.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How to identify task success/failure in the released leaderboard trajectories? #244

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] How to identify task success/failure in the released leaderboard trajectories? #244

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions