Skip to content

BaseMetrics:_compute_majority_at_k and MathMetrics:_compute_reward_at_k handling of duplicate answer strings could potentially be incorrect for non-deterministic judges #1257

@sgunasekar

Description

@sgunasekar

Two related issues happen when the same answer string appears with different judge_correct scores across generations (possible when LLM judge is non-deterministic):

_compute_majority_at_k (base.py:267-288)

Counter for majority answer is built over (answer, score) tuples instead of answer strings, so ("42", True) and ("42", False) are counted as distinct answers rather than two votes for "42" and could cause unintended behavior.

Example: K=5, predicted_answers = ["42", "42", "42", "43", "43"], judge scores = [True, False, True, False, False]:

  • Current: Counter: {("42", True): 2, ("42", False): 1, ("43", False): 2} → majority_score = (1 + 0) / 2 = 0.5
  • Expected: "42" wins with 3 votes → majority_score = 1 (max or majority of judgements) or 0.67 (mean of judgements)

_compute_reward_at_k (math_metrics.py:54-64)

Correctness is stored by answer string with plain assignment (math_metrics.py:58): answer_to_correctness_dict[predicted_answer] = is_correct
This would mean that later occurrences silently overwrite earlier ones, so the score of the reward-winning answer depends on which generation happened to be last in the list rather than a systematic approach.

Suggested fix

  • Group by answer string first, then resolve correctness across all occurrences of that string (e.g. max, min, majority, or mean of judgements).
  • Note: neither issue affects symbolic_correct or any other deterministic scores since the same answer string always produces the same score.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions