Skip to content

Add pass@1[avg-of-N] support to ArenaMetrics#1429

Open
kslowik wants to merge 1 commit intomainfrom
jslowikowski/arena-metrics-pass-at-1-avg-of-n
Open

Add pass@1[avg-of-N] support to ArenaMetrics#1429
kslowik wants to merge 1 commit intomainfrom
jslowikowski/arena-metrics-pass-at-1-avg-of-n

Conversation

@kslowik
Copy link
Copy Markdown
Collaborator

@kslowik kslowik commented May 6, 2026

Summary

ArenaMetrics currently emits only pass@N (best-of-N win rate) when num_repeats > 1. This PR adds pass@1[avg-of-N] (average single-shot win rate) so Arena-Hard runs surface both metrics, matching the convention of sister metric classes (OmniMetrics, MathMetrics) that emit both via BaseMetrics._compute_pass_at_k.

Resolves the TODO in arena_metrics.py:124 ("the class should support pass@k").

Why

Arena-Hard scoring is a global Bradley-Terry / Elo win rate via get_aggregate_score rather than a per-prediction binary correctness score, so ArenaMetrics cannot reuse BaseMetrics._compute_pass_at_k directly. The existing update() picked the best-of-N pair per prompt eagerly and discarded the other N-1 generations, which made pass@1[avg-of-N] impossible to compute downstream.

How

  • update() now stores all per-prediction (judgement-gen-base, judgement-base-gen) score pairs in self.per_prompt_scores (no eager selection, no N==1 vs N>1 branching).
  • get_metrics() derives both metrics from the same stored data:
    • pass@N (best-of-N): pick the most candidate-favorable label per direction across all N predictions for each prompt, then run Elo bootstrap on the resulting one-pair-per-prompt list. Identical numerical behavior to before.
    • pass@1[avg-of-N]: run Elo bootstrap separately on each repeat's predictions (N independent bootstraps over the same 500-prompt set), then average the resulting win rates. Skipped for N==1 (degenerate with pass@1).
  • Per-category breakdown (arena-hard-v2) computed for both metrics via the same _aggregate helper.
  • Numpy types cast to native Python before return so yaml.safe_dump works downstream.
  • evaluations_to_print overridden to drop the inherited majority@N request that Arena never emitted (matches OmniMetrics convention).
  • Drops dead state: self.lengths (never read by anything), self.agg_mode (no longer needed).

Behavior preserved

  • pass@N numerical output unchanged for any input.
  • N==1 case (single prediction per prompt) still emits exactly pass@1 and nothing else.
  • Per-category breakdown semantics unchanged.
  • get_incorrect_sample, _get_judge_score, __init__ all unchanged.

Test plan

  • pytest tests/test_arena_metrics.py — 6/6 passing (5 existing tests + 1 new for num_repeats > 1).
  • yaml.safe_dump round-trip on the returned metrics dict (verifies the numpy-to-native casts).
  • Bit-for-bit match against real Arena-Hard-v2 production data: ran the patched class on output-rs*.jsonl from a 500-prompt × 5-repeat eval; produced pass@5 = 97.82 (matching the unchanged best-of-N path) and pass@1[avg-of-5] = 92.258 (matching an offline reference computation that uses the same np.random.seed(42) Elo bootstrap; per-repeat scores [91.99, 91.95, 93.82, 91.48, 92.05]).

Summary by CodeRabbit

  • Bug Fixes

    • Improved metric computation for pass@N and pass@1 (avg-of-N), including correct per-repeat averaging and more accurate aggregation; outputs now serialize to native types.
  • Refactor

    • Metrics redesigned to store per-prompt score pairs and select best-direction pairs, enabling richer Arena-style metric derivations.
  • Tests

    • Expanded tests covering pass@N, pass@1[avg-of-N], repeat behavior, and invalid-score handling.
  • Breaking Changes

    • Removed get_incorrect_sample() from the public evaluation API; printed metrics now omit majority@k-like keys.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

📝 Walkthrough

Walkthrough

ArenaMetrics was rewritten to store per-prompt (gen-base, base-gen) judgement pairs, select a best-per-direction label via a new private helper, and derive pass@N and pass@1[avg-of-N] from that stored data. Aggregation and numpy→native casting were added; get_incorrect_sample was removed.

Changes

ArenaMetrics Redesign

Layer / File(s) Summary
Imports & Constants
nemo_skills/evaluation/metrics/arena_metrics.py
Added mean and direction-preference tuples used to pick the most candidate-favorable label per judgement direction.
Data Shape / Storage
nemo_skills/evaluation/metrics/arena_metrics.py
update() now appends per-prompt lists of (gen-base, base-gen) score tuples to self.per_prompt_scores and records prompt categories in self.categories. reset() initializes these structures.
Best-Pair Selection
nemo_skills/evaluation/metrics/arena_metrics.py
New static helper _best_pair(prompt_pairs) chooses the preferred label per direction from an N-sized pool of prediction pairs.
Metrics Derivation
nemo_skills/evaluation/metrics/arena_metrics.py
get_metrics() computes pass@N by applying _best_pair across prompts and _aggregate(). For N>1 it also computes pass@1[avg-of-N] by averaging N single-shot aggregations (per-repeat).
Aggregation & Casting
nemo_skills/evaluation/metrics/arena_metrics.py
_aggregate() invokes external get_aggregate_score, wraps results with _native_aggregate_score() to cast numpy types to native Python, and emits per-category sub-aggregates when categories exist.
Repeat Averaging
nemo_skills/evaluation/metrics/arena_metrics.py
Added _average_aggregations() to average per-repeat metric dicts into a single pass@1[avg-of-N] result, summing invalid counts and exposing per_repeat_scores.
API / Cleanup
nemo_skills/evaluation/metrics/arena_metrics.py
Removed public method get_incorrect_sample. evaluations_to_print() overridden to list Arena-specific keys when max_k>1.
Tests
tests/test_arena_metrics.py
Updated tests to assert on per_prompt_scores structure and added test_arena_metrics_pass_at_k_with_repeats validating pass@N, pass@1[avg-of-N], num_entries, per_repeat_scores length, and relation between best-of-N and averaged scores.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Add pass@1[avg-of-N] support to ArenaMetrics' directly and specifically describes the main feature addition. It aligns with the primary objective of adding pass@1[avg-of-N] metric support, though it doesn't mention the equally important pass@N best-of-N metric that was refactored.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jslowikowski/arena-metrics-pass-at-1-avg-of-n

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/evaluation/metrics/arena_metrics.py`:
- Around line 159-165: evaluations_to_print currently compares self.max_k to 1
which raises TypeError when max_k is None; change the method to mirror
get_metrics by normalizing n = self.max_k or 1 and then use n in the logic and
in the formatted metric names (e.g., use n in place of self.max_k for the
pass@{n} and pass@1[avg-of-{n}] strings) so the method safely handles None and
matches get_metrics behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a1ade340-c011-4bea-87fb-1438315cd9c0

📥 Commits

Reviewing files that changed from the base of the PR and between 589294c and 9e1a154.

📒 Files selected for processing (2)
  • nemo_skills/evaluation/metrics/arena_metrics.py
  • tests/test_arena_metrics.py

Comment on lines +159 to +165
def evaluations_to_print(self):
# Override BaseMetrics' default — Arena doesn't compute majority@k, so dropping
# that key avoids a missing-key request to the framework's printer (matches the
# OmniMetrics convention).
if self.max_k > 1:
return [f"pass@{self.max_k}", f"pass@1[avg-of-{self.max_k}]"]
return ["pass@1"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

TypeError when max_k is None.

self.max_k > 1 will raise TypeError: '>' not supported between instances of 'NoneType' and 'int' when max_k is unset. This contrasts with get_metrics() which handles this with n = self.max_k or 1.

Proposed fix
     def evaluations_to_print(self):
         # Override BaseMetrics' default — Arena doesn't compute majority@k, so dropping
         # that key avoids a missing-key request to the framework's printer (matches the
         # OmniMetrics convention).
-        if self.max_k > 1:
+        if self.max_k and self.max_k > 1:
             return [f"pass@{self.max_k}", f"pass@1[avg-of-{self.max_k}]"]
         return ["pass@1"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/evaluation/metrics/arena_metrics.py` around lines 159 - 165,
evaluations_to_print currently compares self.max_k to 1 which raises TypeError
when max_k is None; change the method to mirror get_metrics by normalizing n =
self.max_k or 1 and then use n in the logic and in the formatted metric names
(e.g., use n in place of self.max_k for the pass@{n} and pass@1[avg-of-{n}]
strings) so the method safely handles None and matches get_metrics behavior.

@kslowik kslowik force-pushed the jslowikowski/arena-metrics-pass-at-1-avg-of-n branch from 9e1a154 to 00878f8 Compare May 6, 2026 00:13
Signed-off-by: Jakub Slowikowski <jslowikowski@nvidia.com>
@kslowik kslowik force-pushed the jslowikowski/arena-metrics-pass-at-1-avg-of-n branch from 00878f8 to 2839e25 Compare May 10, 2026 19:20
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/test_arena_metrics.py (1)

165-185: ⚡ Quick win

Add repeat-mode per-category assertions to cover the new category path.

This test currently uses one category only, so it won’t catch regressions in per-category emission for num_repeats > 1 (which is part of the PR behavior).

Suggested test extension
-    for _ in range(n_prompts):
-        preds = [_make_prediction(*random.choice(scores_pool), category="test") for _ in range(n_repeats)]
+    for i in range(n_prompts):
+        category = "hard_prompt" if i % 2 == 0 else "creative_writing"
+        preds = [_make_prediction(*random.choice(scores_pool), category=category) for _ in range(n_repeats)]
         m.update(preds)
@@
     assert metrics[pass_at_n]["num_entries"] == n_prompts
     assert metrics[avg_of_n]["num_entries"] == n_prompts
+
+    for metric_name in (pass_at_n, avg_of_n):
+        assert "category_hard_prompt" in metrics[metric_name]
+        assert "category_creative_writing" in metrics[metric_name]
+        assert metrics[metric_name]["category_hard_prompt"]["num_entries"] == n_prompts // 2
+        assert metrics[metric_name]["category_creative_writing"]["num_entries"] == n_prompts // 2
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_arena_metrics.py` around lines 165 - 185, Extend the test to emit
predictions for at least two distinct categories (e.g., keep category="test" and
add category="alt") when calling m.update, then call m.get_metrics() and assert
that for each category there are per-category metric keys built from pass_at_n
and avg_of_n (e.g., f"{category}:{pass_at_n}" and f"{category}:{avg_of_n}"),
that each per-category metrics["..."]["num_entries"] == n_prompts, and that for
each category metrics["...pass@N"]["score"] >= metrics["...avg-of-N"]["score"];
use the existing variables and functions (m.update, m.get_metrics, pass_at_n,
avg_of_n, and the category argument used in _make_prediction) to locate and
implement these assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/test_arena_metrics.py`:
- Around line 165-185: Extend the test to emit predictions for at least two
distinct categories (e.g., keep category="test" and add category="alt") when
calling m.update, then call m.get_metrics() and assert that for each category
there are per-category metric keys built from pass_at_n and avg_of_n (e.g.,
f"{category}:{pass_at_n}" and f"{category}:{avg_of_n}"), that each per-category
metrics["..."]["num_entries"] == n_prompts, and that for each category
metrics["...pass@N"]["score"] >= metrics["...avg-of-N"]["score"]; use the
existing variables and functions (m.update, m.get_metrics, pass_at_n, avg_of_n,
and the category argument used in _make_prediction) to locate and implement
these assertions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 12bb5654-3119-4378-a0de-297ba30f9c1d

📥 Commits

Reviewing files that changed from the base of the PR and between 00878f8 and 2839e25.

📒 Files selected for processing (2)
  • nemo_skills/evaluation/metrics/arena_metrics.py
  • tests/test_arena_metrics.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/evaluation/metrics/arena_metrics.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant