Conversation
📝 WalkthroughWalkthroughArenaMetrics was rewritten to store per-prompt (gen-base, base-gen) judgement pairs, select a best-per-direction label via a new private helper, and derive pass@N and pass@1[avg-of-N] from that stored data. Aggregation and numpy→native casting were added; ChangesArenaMetrics Redesign
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/evaluation/metrics/arena_metrics.py`:
- Around line 159-165: evaluations_to_print currently compares self.max_k to 1
which raises TypeError when max_k is None; change the method to mirror
get_metrics by normalizing n = self.max_k or 1 and then use n in the logic and
in the formatted metric names (e.g., use n in place of self.max_k for the
pass@{n} and pass@1[avg-of-{n}] strings) so the method safely handles None and
matches get_metrics behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a1ade340-c011-4bea-87fb-1438315cd9c0
📒 Files selected for processing (2)
nemo_skills/evaluation/metrics/arena_metrics.pytests/test_arena_metrics.py
| def evaluations_to_print(self): | ||
| # Override BaseMetrics' default — Arena doesn't compute majority@k, so dropping | ||
| # that key avoids a missing-key request to the framework's printer (matches the | ||
| # OmniMetrics convention). | ||
| if self.max_k > 1: | ||
| return [f"pass@{self.max_k}", f"pass@1[avg-of-{self.max_k}]"] | ||
| return ["pass@1"] |
There was a problem hiding this comment.
TypeError when max_k is None.
self.max_k > 1 will raise TypeError: '>' not supported between instances of 'NoneType' and 'int' when max_k is unset. This contrasts with get_metrics() which handles this with n = self.max_k or 1.
Proposed fix
def evaluations_to_print(self):
# Override BaseMetrics' default — Arena doesn't compute majority@k, so dropping
# that key avoids a missing-key request to the framework's printer (matches the
# OmniMetrics convention).
- if self.max_k > 1:
+ if self.max_k and self.max_k > 1:
return [f"pass@{self.max_k}", f"pass@1[avg-of-{self.max_k}]"]
return ["pass@1"]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/evaluation/metrics/arena_metrics.py` around lines 159 - 165,
evaluations_to_print currently compares self.max_k to 1 which raises TypeError
when max_k is None; change the method to mirror get_metrics by normalizing n =
self.max_k or 1 and then use n in the logic and in the formatted metric names
(e.g., use n in place of self.max_k for the pass@{n} and pass@1[avg-of-{n}]
strings) so the method safely handles None and matches get_metrics behavior.
9e1a154 to
00878f8
Compare
Signed-off-by: Jakub Slowikowski <jslowikowski@nvidia.com>
00878f8 to
2839e25
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/test_arena_metrics.py (1)
165-185: ⚡ Quick winAdd repeat-mode per-category assertions to cover the new category path.
This test currently uses one category only, so it won’t catch regressions in per-category emission for
num_repeats > 1(which is part of the PR behavior).Suggested test extension
- for _ in range(n_prompts): - preds = [_make_prediction(*random.choice(scores_pool), category="test") for _ in range(n_repeats)] + for i in range(n_prompts): + category = "hard_prompt" if i % 2 == 0 else "creative_writing" + preds = [_make_prediction(*random.choice(scores_pool), category=category) for _ in range(n_repeats)] m.update(preds) @@ assert metrics[pass_at_n]["num_entries"] == n_prompts assert metrics[avg_of_n]["num_entries"] == n_prompts + + for metric_name in (pass_at_n, avg_of_n): + assert "category_hard_prompt" in metrics[metric_name] + assert "category_creative_writing" in metrics[metric_name] + assert metrics[metric_name]["category_hard_prompt"]["num_entries"] == n_prompts // 2 + assert metrics[metric_name]["category_creative_writing"]["num_entries"] == n_prompts // 2🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_arena_metrics.py` around lines 165 - 185, Extend the test to emit predictions for at least two distinct categories (e.g., keep category="test" and add category="alt") when calling m.update, then call m.get_metrics() and assert that for each category there are per-category metric keys built from pass_at_n and avg_of_n (e.g., f"{category}:{pass_at_n}" and f"{category}:{avg_of_n}"), that each per-category metrics["..."]["num_entries"] == n_prompts, and that for each category metrics["...pass@N"]["score"] >= metrics["...avg-of-N"]["score"]; use the existing variables and functions (m.update, m.get_metrics, pass_at_n, avg_of_n, and the category argument used in _make_prediction) to locate and implement these assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/test_arena_metrics.py`:
- Around line 165-185: Extend the test to emit predictions for at least two
distinct categories (e.g., keep category="test" and add category="alt") when
calling m.update, then call m.get_metrics() and assert that for each category
there are per-category metric keys built from pass_at_n and avg_of_n (e.g.,
f"{category}:{pass_at_n}" and f"{category}:{avg_of_n}"), that each per-category
metrics["..."]["num_entries"] == n_prompts, and that for each category
metrics["...pass@N"]["score"] >= metrics["...avg-of-N"]["score"]; use the
existing variables and functions (m.update, m.get_metrics, pass_at_n, avg_of_n,
and the category argument used in _make_prediction) to locate and implement
these assertions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 12bb5654-3119-4378-a0de-297ba30f9c1d
📒 Files selected for processing (2)
nemo_skills/evaluation/metrics/arena_metrics.pytests/test_arena_metrics.py
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/evaluation/metrics/arena_metrics.py
Summary
ArenaMetricscurrently emits onlypass@N(best-of-N win rate) whennum_repeats > 1. This PR addspass@1[avg-of-N](average single-shot win rate) so Arena-Hard runs surface both metrics, matching the convention of sister metric classes (OmniMetrics,MathMetrics) that emit both viaBaseMetrics._compute_pass_at_k.Resolves the TODO in
arena_metrics.py:124("the class should support pass@k").Why
Arena-Hard scoring is a global Bradley-Terry / Elo win rate via
get_aggregate_scorerather than a per-prediction binary correctness score, soArenaMetricscannot reuseBaseMetrics._compute_pass_at_kdirectly. The existingupdate()picked the best-of-N pair per prompt eagerly and discarded the other N-1 generations, which madepass@1[avg-of-N]impossible to compute downstream.How
update()now stores all per-prediction(judgement-gen-base, judgement-base-gen)score pairs inself.per_prompt_scores(no eager selection, noN==1vsN>1branching).get_metrics()derives both metrics from the same stored data:N==1(degenerate with pass@1)._aggregatehelper.yaml.safe_dumpworks downstream.evaluations_to_printoverridden to drop the inheritedmajority@Nrequest that Arena never emitted (matchesOmniMetricsconvention).self.lengths(never read by anything),self.agg_mode(no longer needed).Behavior preserved
pass@Nnumerical output unchanged for any input.N==1case (single prediction per prompt) still emits exactlypass@1and nothing else.get_incorrect_sample,_get_judge_score,__init__all unchanged.Test plan
pytest tests/test_arena_metrics.py— 6/6 passing (5 existing tests + 1 new fornum_repeats > 1).yaml.safe_dumpround-trip on the returned metrics dict (verifies the numpy-to-native casts).output-rs*.jsonlfrom a 500-prompt × 5-repeat eval; producedpass@5 = 97.82(matching the unchanged best-of-N path) andpass@1[avg-of-5] = 92.258(matching an offline reference computation that uses the samenp.random.seed(42)Elo bootstrap; per-repeat scores[91.99, 91.95, 93.82, 91.48, 92.05]).Summary by CodeRabbit
Bug Fixes
Refactor
Tests
Breaking Changes