@@ -71,17 +71,17 @@ def _add_std_metrics(self, metrics_dict):
7171 Only adds columns to pass@1[avg-of-{k}] that must exist in the passed metrics_dict.
7272
7373 Example (max_k=4):
74- 3 samples × 4 attempts = [[1,0, 1,0], [1,1,0,0 ], [0,1,1 ,1]]
74+ 4 attempts x 3 samples = [[1,1,0], [0, 1,1], [1,0,1 ], [0,0 ,1]]
7575
76- Standard deviation and error of average metric values across runs (transpose) :
76+ Standard deviation and error of average metric values across runs:
7777 - Run 1: [1,1,0] → avg 0.6667
7878 - Run 2: [0,1,1] → avg 0.6667
7979 - Run 3: [1,0,1] → avg 0.6667
8080 - Run 4: [0,0,1] → avg 0.3333
8181 → std_dev_across_runs = stdev([0.6667, 0.6667, 0.6667, 0.3333]) ≈ 0.1925
8282 → std_err_across_runs = 0.1925 / sqrt(4) ≈ 0.096
8383
84- Average of per-sample standard deviations:
84+ Average of per-sample standard deviations (transpose) :
8585 - Sample 1: stdev([1,0,1,0]) ≈ 0.5773
8686 - Sample 2: stdev([1,1,0,0]) ≈ 0.5773
8787 - Sample 3: stdev([0,1,1,1]) ≈ 0.5000
0 commit comments