Skip to content

Success metric: abs_diff tolerance causes false positives for small-magnitude target values #33

@k00513010-blip

Description

@k00513010-blip

Summary

The Success metric in benchmark/metrics.py uses three OR'd conditions for approximate numeric equality:

is_approx_equal = (
    rea < 1e-6 or                        # Condition 1: relative error < 1e-6
    abs_diff < 1e-6 or                    # Condition 2: absolute diff < 1e-6
    (abs_diff < larger_magnitude * 1e-4)  # Condition 3: 0.01% of magnitude
)

Condition 2 (abs_diff < 1e-6) causes false positives when target values have very small absolute magnitudes (e.g., ~1e-13). In such cases, the absolute difference between any two small values is trivially less than 1e-6, regardless of their relative error.

Affected Tasks

Three tasks in the Astronomy workload have target answers with magnitudes ~1e-13:

Task Target Example Prediction Relative Error abs_diff < 1e-6? Scored
astronomy-easy-3 7.95e-13 8.02e-13 0.9% trivially true 1 (false positive)
astronomy-hard-7 1.211e-13 3.161e-13 796% trivially true 1 (false positive)
astronomy-hard-11 4.638e-13 5.05e-13 8.9% trivially true 1 (false positive)

For astronomy-hard-7, a prediction that is ~2.6x off from the target still receives success=1 because |3.161e-13 - 1.211e-13| = 1.95e-13 < 1e-6.

Root Cause

The abs_diff < 1e-6 condition is appropriate for values of "normal" magnitude (e.g., comparing 1.0000001 to 1.0), but becomes meaningless for values whose magnitude is already far below 1e-6. At that scale, it effectively reduces to True for any pair of small numbers.

Suggested Fix

Remove Condition 2 and rely on Conditions 1 and 3, which are both relative and therefore scale-invariant:

is_approx_equal = (
    rea < 1e-6 or                        # Condition 1: relative error < 1e-6
    (abs_diff < larger_magnitude * 1e-4)  # Condition 3: 0.01% of magnitude
)

Alternatively, if an absolute tolerance is desired for near-zero values, tie it to the target magnitude:

is_approx_equal = (
    rea < 1e-6 or
    (abs_diff < max(abs(target_num) * 1e-4, 1e-15))  # scale-aware absolute tolerance
)

Relation to Issue #15

This is a sibling concern to #15 (Robust Percentage Handling), which also addresses tolerance edge cases in the Success metric. Both issues stem from the same approximate-equality code path.

Impact on Benchmark Scores

Since this bug is in the official evaluator, it affects all systems equally and does not bias comparisons between systems evaluated with the same evaluator. However, it inflates absolute scores by 1-3 percentage points for systems that produce numeric predictions for the affected tasks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions