Two related issues happen when the same answer string appears with different judge_correct scores across generations (possible when LLM judge is non-deterministic):
_compute_majority_at_k (base.py:267-288)
Counter for majority answer is built over (answer, score) tuples instead of answer strings, so ("42", True) and ("42", False) are counted as distinct answers rather than two votes for "42" and could cause unintended behavior.
Example: K=5, predicted_answers = ["42", "42", "42", "43", "43"], judge scores = [True, False, True, False, False]:
- Current: Counter: {("42", True): 2, ("42", False): 1, ("43", False): 2} → majority_score = (1 + 0) / 2 = 0.5
- Expected: "42" wins with 3 votes → majority_score = 1 (max or majority of judgements) or 0.67 (mean of judgements)
_compute_reward_at_k (math_metrics.py:54-64)
Correctness is stored by answer string with plain assignment (math_metrics.py:58): answer_to_correctness_dict[predicted_answer] = is_correct
This would mean that later occurrences silently overwrite earlier ones, so the score of the reward-winning answer depends on which generation happened to be last in the list rather than a systematic approach.
Suggested fix
- Group by answer string first, then resolve correctness across all occurrences of that string (e.g. max, min, majority, or mean of judgements).
- Note: neither issue affects symbolic_correct or any other deterministic scores since the same answer string always produces the same score.
Two related issues happen when the same answer string appears with different judge_correct scores across generations (possible when LLM judge is non-deterministic):
_compute_majority_at_k(base.py:267-288)Counter for majority answer is built over (answer, score) tuples instead of answer strings, so ("42", True) and ("42", False) are counted as distinct answers rather than two votes for "42" and could cause unintended behavior.
Example: K=5, predicted_answers = ["42", "42", "42", "43", "43"], judge scores = [True, False, True, False, False]:
_compute_reward_at_k(math_metrics.py:54-64)Correctness is stored by answer string with plain assignment (math_metrics.py:58):
answer_to_correctness_dict[predicted_answer] = is_correctThis would mean that later occurrences silently overwrite earlier ones, so the score of the reward-winning answer depends on which generation happened to be last in the list rather than a systematic approach.
Suggested fix