Hello,
When I was running the benchmark evaluation, I encountered an unexpected result in the evaluation output. Specifically, for a task where num_complete_trials is 1, the mean_success_rate is reported as 0.5.
Here is the relevant line from the evaluation results:
| Index |
Task |
num_complete_trials |
mean_success_rate |
mean_episode_length |
total_runtime_s |
num_fail_trials |
| 114 |
TurnOnWifiAndOpenApp |
1 |
0.5 |
13 |
107.6 |
0 |
My understanding is that if num_complete_trials is 1, the mean_success_rate should logically be either 0.0 (if the single trial failed) or 1.0 (if the single trial succeeded). A value of 0.5 for a single trial seems contradictory.
Could you please clarify why this might be the case? Is there a specific interpretation or calculation method I'm missing, or could this be an anomaly in the reporting?
Thank you for your time and assistance.