Skip to content

Add Google FACTS Grounding benchmark#1407

Draft
MahanFathi wants to merge 9 commits into
mainfrom
mfathi/google-facts-grounding
Draft

Add Google FACTS Grounding benchmark#1407
MahanFathi wants to merge 9 commits into
mainfrom
mfathi/google-facts-grounding

Conversation

@MahanFathi
Copy link
Copy Markdown
Collaborator

Summary

Adds support for the Google FACTS Grounding benchmark in NeMo-Skills.

This includes:

  • public-split data preparation from google/FACTS-grounding-public
  • generation prompt configuration for FACTS Grounding tasks
  • a 3-judge FACTS Grounding evaluator using Gemini, GPT-5.2, and Claude judge models
  • paper-aligned eligibility handling where a response is disqualified only when all eligibility judges mark it ineligible
  • FACTS-specific metrics, including final_factuality, unadjusted_factuality, eligibility rate, confidence intervals, per-judge scores, and sentence-label statistics
  • bounded datapoint-level judge scheduling so multi-call judge rows complete reliably under lower concurrency
  • per-judge API failure handling so one provider error does not abort the full evaluation
  • docs for running and reporting the benchmark

Validation

  • pre-commit run --files nemo_skills/dataset/facts_grounding/__init__.py nemo_skills/dataset/facts_grounding/prepare.py nemo_skills/evaluation/metrics/facts_grounding_metrics.py nemo_skills/evaluation/metrics/map_metrics.py nemo_skills/inference/eval/facts_grounding_judge.py nemo_skills/prompt/config/generic/facts_grounding.yaml nemo_skills/prompt/config/judge/facts_grounding.yaml tests/test_facts_grounding_metrics.py
  • pre-commit run --files docs/evaluation/other-benchmarks.md
  • pytest tests/test_facts_grounding_metrics.py -q
  • python -m py_compile nemo_skills/inference/eval/facts_grounding_judge.py

Notes

The local public-split run for Nemotron-3-Nano produced final_factuality = 39.81% using the NeMo-Skills default judge set. This is a public-split comparison and is not directly identical to the Kaggle private leaderboard score.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Mirrors google-facts reference compute_scores: per-judge grounding/quality
verdicts fanned out across Gemini 3.1 Pro / GPT-5.2 / Claude Opus 4.5 (all
via inference-api.nvidia.com), aggregated into unadjusted_factuality,
final_factuality (consensus-ineligible zeroed), eligibility_rate, per-judge
slices, Wilson 95% CIs, and sentence-level label micro-averages.

Handles per-endpoint quirks: drops temperature for GPT-5/o-series, drops
top_p for non-Gemini judges (Bedrock rejects both). Single-judge fallback
preserved when judge_models is empty.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
When judges emit a single JSON array ``[{...}, {...}]`` instead of the
newline-delimited object form, ``parse_grounding_json`` crashed with
``AttributeError: 'list' object has no attribute 'get'``. Flatten either
shape into a list of dicts; skip non-dict entries defensively.

This happened to surface on the Nemotron v2 run (parse_reasoning=True →
shorter, cleaner final answers → judges more likely to return a compact
single-array response), crashing the whole judge stage after 45s.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant