Summary
The Success metric in benchmark/metrics.py uses three OR'd conditions for approximate numeric equality:
is_approx_equal = (
rea < 1e-6 or # Condition 1: relative error < 1e-6
abs_diff < 1e-6 or # Condition 2: absolute diff < 1e-6
(abs_diff < larger_magnitude * 1e-4) # Condition 3: 0.01% of magnitude
)
Condition 2 (abs_diff < 1e-6) causes false positives when target values have very small absolute magnitudes (e.g., ~1e-13). In such cases, the absolute difference between any two small values is trivially less than 1e-6, regardless of their relative error.
Affected Tasks
Three tasks in the Astronomy workload have target answers with magnitudes ~1e-13:
| Task |
Target |
Example Prediction |
Relative Error |
abs_diff < 1e-6? |
Scored |
astronomy-easy-3 |
7.95e-13 |
8.02e-13 |
0.9% |
trivially true |
1 (false positive) |
astronomy-hard-7 |
1.211e-13 |
3.161e-13 |
796% |
trivially true |
1 (false positive) |
astronomy-hard-11 |
4.638e-13 |
5.05e-13 |
8.9% |
trivially true |
1 (false positive) |
For astronomy-hard-7, a prediction that is ~2.6x off from the target still receives success=1 because |3.161e-13 - 1.211e-13| = 1.95e-13 < 1e-6.
Root Cause
The abs_diff < 1e-6 condition is appropriate for values of "normal" magnitude (e.g., comparing 1.0000001 to 1.0), but becomes meaningless for values whose magnitude is already far below 1e-6. At that scale, it effectively reduces to True for any pair of small numbers.
Suggested Fix
Remove Condition 2 and rely on Conditions 1 and 3, which are both relative and therefore scale-invariant:
is_approx_equal = (
rea < 1e-6 or # Condition 1: relative error < 1e-6
(abs_diff < larger_magnitude * 1e-4) # Condition 3: 0.01% of magnitude
)
Alternatively, if an absolute tolerance is desired for near-zero values, tie it to the target magnitude:
is_approx_equal = (
rea < 1e-6 or
(abs_diff < max(abs(target_num) * 1e-4, 1e-15)) # scale-aware absolute tolerance
)
Relation to Issue #15
This is a sibling concern to #15 (Robust Percentage Handling), which also addresses tolerance edge cases in the Success metric. Both issues stem from the same approximate-equality code path.
Impact on Benchmark Scores
Since this bug is in the official evaluator, it affects all systems equally and does not bias comparisons between systems evaluated with the same evaluator. However, it inflates absolute scores by 1-3 percentage points for systems that produce numeric predictions for the affected tasks.
Summary
The
Successmetric inbenchmark/metrics.pyuses three OR'd conditions for approximate numeric equality:Condition 2 (
abs_diff < 1e-6) causes false positives when target values have very small absolute magnitudes (e.g., ~1e-13). In such cases, the absolute difference between any two small values is trivially less than 1e-6, regardless of their relative error.Affected Tasks
Three tasks in the Astronomy workload have target answers with magnitudes ~1e-13:
abs_diff < 1e-6?astronomy-easy-3astronomy-hard-7astronomy-hard-11For
astronomy-hard-7, a prediction that is ~2.6x off from the target still receivessuccess=1because|3.161e-13 - 1.211e-13| = 1.95e-13 < 1e-6.Root Cause
The
abs_diff < 1e-6condition is appropriate for values of "normal" magnitude (e.g., comparing 1.0000001 to 1.0), but becomes meaningless for values whose magnitude is already far below 1e-6. At that scale, it effectively reduces toTruefor any pair of small numbers.Suggested Fix
Remove Condition 2 and rely on Conditions 1 and 3, which are both relative and therefore scale-invariant:
Alternatively, if an absolute tolerance is desired for near-zero values, tie it to the target magnitude:
Relation to Issue #15
This is a sibling concern to #15 (Robust Percentage Handling), which also addresses tolerance edge cases in the
Successmetric. Both issues stem from the same approximate-equality code path.Impact on Benchmark Scores
Since this bug is in the official evaluator, it affects all systems equally and does not bias comparisons between systems evaluated with the same evaluator. However, it inflates absolute scores by 1-3 percentage points for systems that produce numeric predictions for the affected tasks.