Add Google FACTS Grounding benchmark by MahanFathi · Pull Request #1407 · NVIDIA-NeMo/Skills

MahanFathi · 2026-04-30T14:44:17Z

Summary

Adds support for the Google FACTS Grounding benchmark in NeMo-Skills.

This includes:

public-split data preparation from google/FACTS-grounding-public
generation prompt configuration for FACTS Grounding tasks
a 3-judge FACTS Grounding evaluator using Gemini, GPT-5.2, and Claude judge models
paper-aligned eligibility handling where a response is disqualified only when all eligibility judges mark it ineligible
FACTS-specific metrics, including final_factuality, unadjusted_factuality, eligibility rate, confidence intervals, per-judge scores, and sentence-label statistics
bounded datapoint-level judge scheduling so multi-call judge rows complete reliably under lower concurrency
per-judge API failure handling so one provider error does not abort the full evaluation
docs for running and reporting the benchmark

Validation

pre-commit run --files nemo_skills/dataset/facts_grounding/__init__.py nemo_skills/dataset/facts_grounding/prepare.py nemo_skills/evaluation/metrics/facts_grounding_metrics.py nemo_skills/evaluation/metrics/map_metrics.py nemo_skills/inference/eval/facts_grounding_judge.py nemo_skills/prompt/config/generic/facts_grounding.yaml nemo_skills/prompt/config/judge/facts_grounding.yaml tests/test_facts_grounding_metrics.py
pre-commit run --files docs/evaluation/other-benchmarks.md
pytest tests/test_facts_grounding_metrics.py -q
python -m py_compile nemo_skills/inference/eval/facts_grounding_judge.py

Notes

The local public-split run for Nemotron-3-Nano produced final_factuality = 39.81% using the NeMo-Skills default judge set. This is a public-split comparison and is not directly identical to the Kaggle private leaderboard score.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Mirrors google-facts reference compute_scores: per-judge grounding/quality verdicts fanned out across Gemini 3.1 Pro / GPT-5.2 / Claude Opus 4.5 (all via inference-api.nvidia.com), aggregated into unadjusted_factuality, final_factuality (consensus-ineligible zeroed), eligibility_rate, per-judge slices, Wilson 95% CIs, and sentence-level label micro-averages. Handles per-endpoint quirks: drops temperature for GPT-5/o-series, drops top_p for non-Gemini judges (Bedrock rejects both). Single-judge fallback preserved when judge_models is empty. Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

When judges emit a single JSON array ``[{...}, {...}]`` instead of the newline-delimited object form, ``parse_grounding_json`` crashed with ``AttributeError: 'list' object has no attribute 'get'``. Flatten either shape into a list of dicts; skip non-dict entries defensively. This happened to surface on the Nemotron v2 run (parse_reasoning=True → shorter, cleaner final answers → judges more likely to return a compact single-array response), crashing the whole judge stage after 45s. Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

…rounding

MahanFathi added 9 commits April 16, 2026 09:53

Add Google FACTS Grounding benchmark

86cf1c5

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Align FACTS grounding eligibility with paper

8a45575

Bound FACTS judge datapoint fan-out

2fcb95a

Handle per-judge FACTS API failures

8080a2a

Document FACTS Grounding benchmark

5768965

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Remove FACTS Grounding helper script

31a7566

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Merge remote-tracking branch 'origin/main' into mfathi/google-facts-g…

c27eb8e

…rounding

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Google FACTS Grounding benchmark#1407

Add Google FACTS Grounding benchmark#1407
MahanFathi wants to merge 9 commits into
mainfrom
mfathi/google-facts-grounding

MahanFathi commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MahanFathi commented Apr 30, 2026

Summary

Validation

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant