Hi WebArena team,
I'm looking into the trajectory data released on the WebArena leaderboard (e.g., for various agents IBM CUGA, OpenAI Operator, etc.). I would like to perform some analysis on the successful vs. failed trajectories to understand agent behaviors better.
However, looking at the raw trajectory JSON files (e.g. the structure in test.raw.json or the agent output files), it is not immediately obvious which field indicates the final evaluation result (Success/Fail or Score 0/1).
Questions:
- Do the standard trajectory files released on the leaderboard include the evaluation result? If so, which specific key should I look for?
- If the trajectory files themselves do not contain the score, is there a separate manifest or summary file for each submission that maps
task_id to its final score?
- Since ~50% of tasks require
program_html (live environment) evaluation, is it correct to assume that we cannot re-evaluate these trajectories offline using only the JSON logs?
Any guidance on how to map the released trajectories to their pass/fail status would be appreciated.
Thank you!