BaseMetrics:_compute_majority_at_k and MathMetrics:_compute_reward_at_k handling of duplicate answer strings could potentially be incorrect for non-deterministic judges

Two related issues happen when the same answer string appears with different judge_correct scores across generations (possible when LLM judge is  non-deterministic):                                                                                                                                   

#### `_compute_majority_at_k` (base.py:267-288)

Counter for majority answer is built over (answer, score) tuples instead of answer strings, so ("42", True) and ("42", False) are counted as distinct answers rather than two votes for "42" and could cause unintended behavior.

Example: K=5, predicted_answers = ["42", "42", "42", "43", "43"], judge scores = [True, False, True, False, False]:
- Current: Counter: {("42", True): 2, ("42", False): 1, ("43", False): 2} → majority_score = (1 + 0) / 2 = 0.5
- Expected: "42" wins with 3 votes → majority_score = 1 (max or majority of judgements) or 0.67 (mean of judgements)

#### `_compute_reward_at_k` (math_metrics.py:54-64)
Correctness is stored by answer string with plain assignment (math_metrics.py:58): `answer_to_correctness_dict[predicted_answer] = is_correct`
This would mean that later occurrences silently overwrite earlier ones, so the score of the reward-winning answer depends on which generation happened to be last in the list rather than a systematic approach.

#### Suggested fix
- Group by answer string first, then resolve correctness across all occurrences of that string (e.g. max, min, majority, or mean of judgements).
- Note: neither issue affects symbolic_correct or any other deterministic scores since the same answer string always produces the same score.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BaseMetrics:_compute_majority_at_k and MathMetrics:_compute_reward_at_k handling of duplicate answer strings could potentially be incorrect for non-deterministic judges #1257

`_compute_majority_at_k` (base.py:267-288)

`_compute_reward_at_k` (math_metrics.py:54-64)

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

BaseMetrics:_compute_majority_at_k and MathMetrics:_compute_reward_at_k handling of duplicate answer strings could potentially be incorrect for non-deterministic judges #1257

Description

_compute_majority_at_k (base.py:267-288)

_compute_reward_at_k (math_metrics.py:54-64)

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`_compute_majority_at_k` (base.py:267-288)

`_compute_reward_at_k` (math_metrics.py:54-64)