Add pass@1[avg-of-N] support to ArenaMetrics by kslowik · Pull Request #1429 · NVIDIA-NeMo/Skills

kslowik · 2026-05-06T00:07:19Z

Summary

ArenaMetrics currently emits only pass@N (best-of-N win rate) when num_repeats > 1. This PR adds pass@1[avg-of-N] (average single-shot win rate) so Arena-Hard runs surface both metrics, matching the convention of sister metric classes (OmniMetrics, MathMetrics) that emit both via BaseMetrics._compute_pass_at_k.

Resolves the TODO in arena_metrics.py:124 ("the class should support pass@k").

Why

Arena-Hard scoring is a global Bradley-Terry / Elo win rate via get_aggregate_score rather than a per-prediction binary correctness score, so ArenaMetrics cannot reuse BaseMetrics._compute_pass_at_k directly. The existing update() picked the best-of-N pair per prompt eagerly and discarded the other N-1 generations, which made pass@1[avg-of-N] impossible to compute downstream.

How

update() now stores all per-prediction (judgement-gen-base, judgement-base-gen) score pairs in self.per_prompt_scores (no eager selection, no N==1 vs N>1 branching).
get_metrics() derives both metrics from the same stored data:
- pass@N (best-of-N): pick the most candidate-favorable label per direction across all N predictions for each prompt, then run Elo bootstrap on the resulting one-pair-per-prompt list. Identical numerical behavior to before.
- pass@1[avg-of-N]: run Elo bootstrap separately on each repeat's predictions (N independent bootstraps over the same 500-prompt set), then average the resulting win rates. Skipped for N==1 (degenerate with pass@1).
Per-category breakdown (arena-hard-v2) computed for both metrics via the same _aggregate helper.
Numpy types cast to native Python before return so yaml.safe_dump works downstream.
evaluations_to_print overridden to drop the inherited majority@N request that Arena never emitted (matches OmniMetrics convention).
Drops dead state: self.lengths (never read by anything), self.agg_mode (no longer needed).

Behavior preserved

pass@N numerical output unchanged for any input.
N==1 case (single prediction per prompt) still emits exactly pass@1 and nothing else.
Per-category breakdown semantics unchanged.
get_incorrect_sample, _get_judge_score, __init__ all unchanged.

Test plan

pytest tests/test_arena_metrics.py — 6/6 passing (5 existing tests + 1 new for num_repeats > 1).
yaml.safe_dump round-trip on the returned metrics dict (verifies the numpy-to-native casts).
Bit-for-bit match against real Arena-Hard-v2 production data: ran the patched class on output-rs*.jsonl from a 500-prompt × 5-repeat eval; produced pass@5 = 97.82 (matching the unchanged best-of-N path) and pass@1[avg-of-5] = 92.258 (matching an offline reference computation that uses the same np.random.seed(42) Elo bootstrap; per-repeat scores [91.99, 91.95, 93.82, 91.48, 92.05]).

Summary by CodeRabbit

Bug Fixes
- Improved metric computation for pass@N and pass@1 (avg-of-N), including correct per-repeat averaging and more accurate aggregation; outputs now serialize to native types.
Refactor
- Metrics redesigned to store per-prompt score pairs and select best-direction pairs, enabling richer Arena-style metric derivations.
Tests
- Expanded tests covering pass@N, pass@1[avg-of-N], repeat behavior, and invalid-score handling.
Breaking Changes
- Removed get_incorrect_sample() from the public evaluation API; printed metrics now omit majority@k-like keys.

coderabbitai · 2026-05-06T00:12:24Z

📝 Walkthrough

Walkthrough

ArenaMetrics was rewritten to store per-prompt (gen-base, base-gen) judgement pairs, select a best-per-direction label via a new private helper, and derive pass@N and pass@1[avg-of-N] from that stored data. Aggregation and numpy→native casting were added; get_incorrect_sample was removed.

Changes

ArenaMetrics Redesign

Layer / File(s)	Summary
Imports & Constants `nemo_skills/evaluation/metrics/arena_metrics.py`	Added `mean` and direction-preference tuples used to pick the most candidate-favorable label per judgement direction.
Data Shape / Storage `nemo_skills/evaluation/metrics/arena_metrics.py`	`update()` now appends per-prompt lists of (gen-base, base-gen) score tuples to `self.per_prompt_scores` and records prompt categories in `self.categories`. `reset()` initializes these structures.
Best-Pair Selection `nemo_skills/evaluation/metrics/arena_metrics.py`	New static helper `_best_pair(prompt_pairs)` chooses the preferred label per direction from an N-sized pool of prediction pairs.
Metrics Derivation `nemo_skills/evaluation/metrics/arena_metrics.py`	`get_metrics()` computes `pass@N` by applying `_best_pair` across prompts and `_aggregate()`. For N>1 it also computes `pass@1[avg-of-N]` by averaging N single-shot aggregations (per-repeat).
Aggregation & Casting `nemo_skills/evaluation/metrics/arena_metrics.py`	`_aggregate()` invokes external `get_aggregate_score`, wraps results with `_native_aggregate_score()` to cast numpy types to native Python, and emits per-category sub-aggregates when categories exist.
Repeat Averaging `nemo_skills/evaluation/metrics/arena_metrics.py`	Added `_average_aggregations()` to average per-repeat metric dicts into a single `pass@1[avg-of-N]` result, summing invalid counts and exposing `per_repeat_scores`.
API / Cleanup `nemo_skills/evaluation/metrics/arena_metrics.py`	Removed public method `get_incorrect_sample`. `evaluations_to_print()` overridden to list Arena-specific keys when `max_k>1`.
Tests `tests/test_arena_metrics.py`	Updated tests to assert on `per_prompt_scores` structure and added `test_arena_metrics_pass_at_k_with_repeats` validating `pass@N`, `pass@1[avg-of-N]`, `num_entries`, `per_repeat_scores` length, and relation between best-of-N and averaged scores.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Add pass@1[avg-of-N] support to ArenaMetrics' directly and specifically describes the main feature addition. It aligns with the primary objective of adding pass@1[avg-of-N] metric support, though it doesn't mention the equally important pass@N best-of-N metric that was refactored.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jslowikowski/arena-metrics-pass-at-1-avg-of-n

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/evaluation/metrics/arena_metrics.py`:
- Around line 159-165: evaluations_to_print currently compares self.max_k to 1
which raises TypeError when max_k is None; change the method to mirror
get_metrics by normalizing n = self.max_k or 1 and then use n in the logic and
in the formatted metric names (e.g., use n in place of self.max_k for the
pass@{n} and pass@1[avg-of-{n}] strings) so the method safely handles None and
matches get_metrics behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a1ade340-c011-4bea-87fb-1438315cd9c0

📥 Commits

Reviewing files that changed from the base of the PR and between 589294c and 9e1a154.

📒 Files selected for processing (2)

nemo_skills/evaluation/metrics/arena_metrics.py
tests/test_arena_metrics.py

coderabbitai · 2026-05-06T00:12:27Z

+    def evaluations_to_print(self):
+        # Override BaseMetrics' default — Arena doesn't compute majority@k, so dropping
+        # that key avoids a missing-key request to the framework's printer (matches the
+        # OmniMetrics convention).
+        if self.max_k > 1:
+            return [f"pass@{self.max_k}", f"pass@1[avg-of-{self.max_k}]"]
+        return ["pass@1"]


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

TypeError when max_k is None.

self.max_k > 1 will raise TypeError: '>' not supported between instances of 'NoneType' and 'int' when max_k is unset. This contrasts with get_metrics() which handles this with n = self.max_k or 1.

Proposed fix

def evaluations_to_print(self): # Override BaseMetrics' default — Arena doesn't compute majority@k, so dropping # that key avoids a missing-key request to the framework's printer (matches the # OmniMetrics convention). - if self.max_k > 1: + if self.max_k and self.max_k > 1: return [f"pass@{self.max_k}", f"pass@1[avg-of-{self.max_k}]"] return ["pass@1"]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/evaluation/metrics/arena_metrics.py` around lines 159 - 165, evaluations_to_print currently compares self.max_k to 1 which raises TypeError when max_k is None; change the method to mirror get_metrics by normalizing n = self.max_k or 1 and then use n in the logic and in the formatted metric names (e.g., use n in place of self.max_k for the pass@{n} and pass@1[avg-of-{n}] strings) so the method safely handles None and matches get_metrics behavior.

Signed-off-by: Jakub Slowikowski <jslowikowski@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

tests/test_arena_metrics.py (1)

165-185: ⚡ Quick win

Add repeat-mode per-category assertions to cover the new category path.

This test currently uses one category only, so it won’t catch regressions in per-category emission for num_repeats > 1 (which is part of the PR behavior).

Suggested test extension

-    for _ in range(n_prompts):
-        preds = [_make_prediction(*random.choice(scores_pool), category="test") for _ in range(n_repeats)]
+    for i in range(n_prompts):
+        category = "hard_prompt" if i % 2 == 0 else "creative_writing"
+        preds = [_make_prediction(*random.choice(scores_pool), category=category) for _ in range(n_repeats)]
         m.update(preds)
@@
     assert metrics[pass_at_n]["num_entries"] == n_prompts
     assert metrics[avg_of_n]["num_entries"] == n_prompts
+
+    for metric_name in (pass_at_n, avg_of_n):
+        assert "category_hard_prompt" in metrics[metric_name]
+        assert "category_creative_writing" in metrics[metric_name]
+        assert metrics[metric_name]["category_hard_prompt"]["num_entries"] == n_prompts // 2
+        assert metrics[metric_name]["category_creative_writing"]["num_entries"] == n_prompts // 2

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_arena_metrics.py` around lines 165 - 185, Extend the test to emit
predictions for at least two distinct categories (e.g., keep category="test" and
add category="alt") when calling m.update, then call m.get_metrics() and assert
that for each category there are per-category metric keys built from pass_at_n
and avg_of_n (e.g., f"{category}:{pass_at_n}" and f"{category}:{avg_of_n}"),
that each per-category metrics["..."]["num_entries"] == n_prompts, and that for
each category metrics["...pass@N"]["score"] >= metrics["...avg-of-N"]["score"];
use the existing variables and functions (m.update, m.get_metrics, pass_at_n,
avg_of_n, and the category argument used in _make_prediction) to locate and
implement these assertions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/test_arena_metrics.py`:
- Around line 165-185: Extend the test to emit predictions for at least two
distinct categories (e.g., keep category="test" and add category="alt") when
calling m.update, then call m.get_metrics() and assert that for each category
there are per-category metric keys built from pass_at_n and avg_of_n (e.g.,
f"{category}:{pass_at_n}" and f"{category}:{avg_of_n}"), that each per-category
metrics["..."]["num_entries"] == n_prompts, and that for each category
metrics["...pass@N"]["score"] >= metrics["...avg-of-N"]["score"]; use the
existing variables and functions (m.update, m.get_metrics, pass_at_n, avg_of_n,
and the category argument used in _make_prediction) to locate and implement
these assertions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 12bb5654-3119-4378-a0de-297ba30f9c1d

📥 Commits

Reviewing files that changed from the base of the PR and between 00878f8 and 2839e25.

📒 Files selected for processing (2)

nemo_skills/evaluation/metrics/arena_metrics.py
tests/test_arena_metrics.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/evaluation/metrics/arena_metrics.py

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

kslowik force-pushed the jslowikowski/arena-metrics-pass-at-1-avg-of-n branch from 9e1a154 to 00878f8 Compare May 6, 2026 00:13

Add pass@1[avg-of-N] support to ArenaMetrics

2839e25

Signed-off-by: Jakub Slowikowski <jslowikowski@nvidia.com>

kslowik force-pushed the jslowikowski/arena-metrics-pass-at-1-avg-of-n branch from 00878f8 to 2839e25 Compare May 10, 2026 19:20

coderabbitai Bot reviewed May 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pass@1[avg-of-N] support to ArenaMetrics#1429

Add pass@1[avg-of-N] support to ArenaMetrics#1429
kslowik wants to merge 1 commit intomainfrom
jslowikowski/arena-metrics-pass-at-1-avg-of-n

kslowik commented May 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 6, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kslowik commented May 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

How

Behavior preserved

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kslowik commented May 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 6, 2026 •

edited

Loading