Success metric: abs_diff tolerance causes false positives for small-magnitude target values

## Summary

The `Success` metric in `benchmark/metrics.py` uses three OR'd conditions for approximate numeric equality:

```python
is_approx_equal = (
    rea < 1e-6 or                        # Condition 1: relative error < 1e-6
    abs_diff < 1e-6 or                    # Condition 2: absolute diff < 1e-6
    (abs_diff < larger_magnitude * 1e-4)  # Condition 3: 0.01% of magnitude
)
```

**Condition 2** (`abs_diff < 1e-6`) causes false positives when target values have very small absolute magnitudes (e.g., ~1e-13). In such cases, the absolute difference between *any* two small values is trivially less than 1e-6, regardless of their relative error.

## Affected Tasks

Three tasks in the Astronomy workload have target answers with magnitudes ~1e-13:

| Task | Target | Example Prediction | Relative Error | `abs_diff < 1e-6`? | Scored |
|---|---|---|---|---|---|
| `astronomy-easy-3` | 7.95e-13 | 8.02e-13 | 0.9% | trivially true | 1 (false positive) |
| `astronomy-hard-7` | 1.211e-13 | 3.161e-13 | **796%** | trivially true | 1 (false positive) |
| `astronomy-hard-11` | 4.638e-13 | 5.05e-13 | 8.9% | trivially true | 1 (false positive) |

For `astronomy-hard-7`, a prediction that is **~2.6x off** from the target still receives `success=1` because `|3.161e-13 - 1.211e-13| = 1.95e-13 < 1e-6`.

## Root Cause

The `abs_diff < 1e-6` condition is appropriate for values of "normal" magnitude (e.g., comparing 1.0000001 to 1.0), but becomes meaningless for values whose magnitude is already far below 1e-6. At that scale, it effectively reduces to `True` for any pair of small numbers.

## Suggested Fix

Remove Condition 2 and rely on Conditions 1 and 3, which are both relative and therefore scale-invariant:

```python
is_approx_equal = (
    rea < 1e-6 or                        # Condition 1: relative error < 1e-6
    (abs_diff < larger_magnitude * 1e-4)  # Condition 3: 0.01% of magnitude
)
```

Alternatively, if an absolute tolerance is desired for near-zero values, tie it to the target magnitude:

```python
is_approx_equal = (
    rea < 1e-6 or
    (abs_diff < max(abs(target_num) * 1e-4, 1e-15))  # scale-aware absolute tolerance
)
```

## Relation to Issue #15

This is a sibling concern to #15 (Robust Percentage Handling), which also addresses tolerance edge cases in the `Success` metric. Both issues stem from the same approximate-equality code path.

## Impact on Benchmark Scores

Since this bug is in the official evaluator, it affects **all systems equally** and does not bias comparisons between systems evaluated with the same evaluator. However, it inflates absolute scores by 1-3 percentage points for systems that produce numeric predictions for the affected tasks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Success metric: abs_diff tolerance causes false positives for small-magnitude target values #33

Summary

Affected Tasks

Root Cause

Suggested Fix

Relation to Issue #15

Impact on Benchmark Scores

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Task	Target	Example Prediction	Relative Error	`abs_diff < 1e-6`?	Scored
`astronomy-easy-3`	7.95e-13	8.02e-13	0.9%	trivially true	1 (false positive)
`astronomy-hard-7`	1.211e-13	3.161e-13	796%	trivially true	1 (false positive)
`astronomy-hard-11`	4.638e-13	5.05e-13	8.9%	trivially true	1 (false positive)

Success metric: abs_diff tolerance causes false positives for small-magnitude target values #33

Description

Summary

Affected Tasks

Root Cause

Suggested Fix

Relation to Issue #15

Impact on Benchmark Scores

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions