Skip to content

Commit 76c865c

Browse files
authored
feat(byob): add explicit few-shot dataset support (#993)
## Summary Adds explicit BYOB few-shot controls for benchmarks where a split-only rewrite is not enough. - Add `fewshot_dataset` to `@benchmark` so tasks can provide an exact few-shot dataset URI/path, including filters, configs, `data_files`, and other query params. - Add `fewshot_prefix` to prepend static text before rendered few-shot examples. - Fix `--num-fewshot 0` so it overrides non-zero benchmark defaults and enables true 0-shot validation. - Save BYOB predictions by default from generated command templates. - Expose `fewshot_dataset` and `fewshot_prefix` in generated FDF `config.params.extra.dataset`. - Update docs and tests for precedence, fallback, prefix rendering, and explicit 0-shot behavior. ## Why `fewshot_split` only works for simple datasets where changing `split` is enough. Some datasets also require `filter_field` / `filter_value`, `data_files`, configs, or other URI parameters. Reconstructing those generically is fragile and can produce mixed-language or wrong-source few-shot examples. This change lets benchmark authors provide the exact few-shot source when needed while preserving existing `fewshot_split` behavior. ## Test Plan ```bash cd packages/nemo-evaluator uv run python -m pytest \ tests/unit_tests/byob/test_byob_decorators.py::TestBenchmarkLogprobFields \ tests/unit_tests/byob/test_byob_eval_logic.py::TestFewshotPrefix \ tests/unit_tests/byob/test_byob_eval_logic.py::TestBuildFewshotExamples \ tests/unit_tests/byob/test_byob_compiler.py::TestBuildFdfHelper::test_fdf_groups_dataset_config_under_extra_dataset \ tests/unit_tests/byob/test_byob_runner.py::TestFewshotOverride Signed-off-by: kanishks <kanishks@nvidia.com>
1 parent 231526c commit 76c865c

10 files changed

Lines changed: 430 additions & 57 deletions

File tree

docs/libraries/nemo-evaluator/extending/byob/benchmark-decorator.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,9 @@ def check(sample: ScorerInput) -> dict:
3030
| `choices` | `list[str]` | `None` | Static candidate continuations for `endpoint_type="completions_logprob"` |
3131
| `choices_field` | `str` | `None` | Dataset field containing per-row candidate continuations for `endpoint_type="completions_logprob"`; dotted paths such as `choices.text` are supported |
3232
| `num_fewshot` | `int` | `0` | Number of few-shot examples to prepend to each prompt |
33-
| `fewshot_split` | `str` | `None` | Optional split to sample few-shot examples from |
33+
| `fewshot_dataset` | `str` | `None` | Optional explicit dataset URI/path to sample few-shot examples from. Use when the few-shot source needs filters, `data_files`, configs, or other URI options that cannot be expressed by a split name alone. Takes precedence over `fewshot_split`. |
34+
| `fewshot_split` | `str` | `None` | Optional split name to sample few-shot examples from when the primary `dataset` is an `hf://` URI. Used only if `fewshot_dataset` is not set or fails to load. |
35+
| `fewshot_prefix` | `str` | `""` | Optional static text prepended once before the rendered few-shot examples (e.g. `"The following are multiple-choice questions...\n\n"`). |
3436
| `fewshot_template` | `str` | `None` | Optional template for rendering few-shot examples |
3537
| `fewshot_separator` | `str` | `"\n\n"` | Separator between rendered few-shot examples |
3638

docs/libraries/nemo-evaluator/extending/byob/datasets.md

Lines changed: 87 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -144,28 +144,31 @@ Otherwise the shell treats `&` as a background-command separator.
144144

145145
### `extra.dataset.*` namespace
146146

147-
BYOB groups dataset-related configuration under
148-
`config.params.extra.dataset.*` in the FDF / run_config:
147+
BYOB exposes two dataset-related keys under `config.params.extra.dataset.*`
148+
that can be overridden at run time without rebuilding the benchmark:
149149

150-
| Key | Description |
151-
|-----|-------------|
152-
| `path` | Dataset file path or `hf://` URI (compile-time default from `@benchmark(dataset=...)`). |
153-
| `num_fewshot` | Optional few-shot example count (lm-eval-harness parity). |
154-
| `field_mapping` | Informational mirror of `@benchmark(field_mapping=...)`. |
155-
| `choices` / `choices_field` | Informational mirror of `@benchmark(choices=...)` / `@benchmark(choices_field=...)`. |
150+
| Key | CLI flag | Description |
151+
|-----|----------|-------------|
152+
| `path` | `--dataset` | Dataset file path or `hf://` URI. Compile-time default from `@benchmark(dataset=...)`. |
153+
| `num_fewshot` | `--num-fewshot` | Few-shot example count (lm-eval-harness parity). Pass `0` to force true 0-shot for a benchmark that declares a non-zero default. |
154+
155+
All other dataset-related options (`field_mapping`, `choices`, `choices_field`,
156+
`fewshot_dataset`, `fewshot_prefix`, `fewshot_split`, etc.) are baked into the
157+
benchmark at compile time from the `@benchmark(...)` decorator and are not
158+
runtime-overridable — change them in your benchmark module and recompile with
159+
`nemo-evaluator-byob compile`.
156160

157161
### Overriding the dataset at run time
158162

159-
The `@benchmark` decorator's `dataset=` value is the compile-time default. To
160-
swap it for a single run without rebuilding the benchmark, set
161-
`config.params.extra.dataset.path` via the launcher's run_config or CLI. The
162-
launcher deep-merges via OmegaConf, so sibling keys under `extra.dataset`
163-
(`num_fewshot`, `field_mapping`, etc.) and under `extra` (`benchmark_module`,
164-
`requirements`, …) are preserved.
163+
To swap `path` or `num_fewshot` for a single run, set the corresponding key
164+
under `config.params.extra.dataset.*` via the launcher's run_config or CLI.
165+
The launcher deep-merges via OmegaConf, so sibling keys (and unrelated keys
166+
under `extra` such as `benchmark_module`, `requirements`, …) are preserved.
165167

166168
```bash
167169
nemo-evaluator-launcher run --config my_config.yaml \
168-
-o 'evaluation.tasks.<task_name>.nemo_evaluator_config.config.params.extra.dataset.path=hf://other/foo?split=test'
170+
-o 'evaluation.tasks.<task_name>.nemo_evaluator_config.config.params.extra.dataset.path=hf://other/foo?split=test' \
171+
-o 'evaluation.tasks.<task_name>.nemo_evaluator_config.config.params.extra.dataset.num_fewshot=0'
169172
```
170173

171174
Or in a run_config YAML:
@@ -183,6 +186,75 @@ evaluation:
183186
num_fewshot: 5
184187
```
185188
189+
## Few-shot Examples
190+
191+
BYOB resolves the few-shot example pool with this precedence:
192+
193+
1. **`fewshot_dataset`** — explicit URI/path. Use this when the few-shot
194+
source needs filters, `data_files`, configs, or any other URI options
195+
that cannot be expressed by a split name (e.g.
196+
`hf://my-org/foo?data_files=train.json&filter_field=lang&filter_value=hi`).
197+
2. **`fewshot_split`** — split name reused with the primary `hf://` dataset.
198+
Used only when `fewshot_dataset` is unset *or* fails to load.
199+
3. **Tail of the primary dataset** — last-resort fallback. Logs a loud
200+
warning because the few-shot pool overlaps with rows being evaluated,
201+
risking gold-answer leakage into the prompt.
202+
203+
### Examples
204+
205+
Few-shot from a different split of the same HuggingFace dataset:
206+
207+
```python
208+
@benchmark(
209+
name="mmlu-mini",
210+
dataset="hf://my-org/mmlu?split=test",
211+
prompt="Question: {question}\nAnswer:",
212+
target_field="answer",
213+
num_fewshot=5,
214+
fewshot_split="dev",
215+
)
216+
```
217+
218+
Few-shot from a completely different dataset URI (filters, data_files, etc.):
219+
220+
```python
221+
@benchmark(
222+
name="boolq-hi",
223+
dataset="hf://sarvamai/boolq-indic?split=validation&filter_field=language&filter_value=hi",
224+
prompt="Passage: {passage}\nQuestion: {question}\nAnswer:",
225+
target_field="answer",
226+
num_fewshot=4,
227+
fewshot_dataset="hf://sarvamai/boolq-indic?split=train&filter_field=language&filter_value=hi",
228+
)
229+
```
230+
231+
Add a static introduction before the few-shot examples:
232+
233+
```python
234+
@benchmark(
235+
name="indommlu",
236+
dataset="hf://indolem/IndoMMLU?split=test&trust_remote_code=true",
237+
prompt="{question}\n\n{options}\n\nAnswer:",
238+
target_field="answer",
239+
num_fewshot=5,
240+
fewshot_split="train",
241+
fewshot_prefix="The following are multiple-choice questions. Choose the best answer.\n\n",
242+
)
243+
```
244+
245+
The final prompt sent to the model is:
246+
247+
```text
248+
<fewshot_prefix><example_1><fewshot_separator>...<example_N><fewshot_separator><test_prompt>
249+
```
250+
251+
:::{tip}
252+
At run time you can force a true 0-shot evaluation against a benchmark
253+
that declares a non-zero `num_fewshot` by passing `--num-fewshot 0` on
254+
the `nemo-evaluator run_eval` CLI. The flag is `None` by default; an
255+
explicit `0` overrides the benchmark default.
256+
:::
257+
186258
## Field Mapping
187259

188260
Use `field_mapping` to rename dataset columns so they match the `{placeholder}` names in your prompt template. The mapping is applied after loading the dataset and before prompt rendering.

packages/nemo-evaluator/src/nemo_evaluator/contrib/byob/compiler.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,10 @@
8484
" and config.params.extra.dataset.num_fewshot is not none %}"
8585
" --num-fewshot {{config.params.extra.dataset.num_fewshot}}"
8686
"{% endif %}"
87+
"{% if config.params.extra.save_predictions is not defined"
88+
" or config.params.extra.save_predictions %}"
89+
" --save-predictions"
90+
"{% endif %}"
8791
)
8892

8993

@@ -113,6 +117,10 @@ def _build_fdf(
113117
dataset_params["field_mapping"] = bench.field_mapping
114118
if bench.num_fewshot:
115119
dataset_params["num_fewshot"] = bench.num_fewshot
120+
if bench.fewshot_dataset:
121+
dataset_params["fewshot_dataset"] = bench.fewshot_dataset
122+
if bench.fewshot_prefix:
123+
dataset_params["fewshot_prefix"] = bench.fewshot_prefix
116124
# Multiple-choice loglikelihood metadata (informational; the runner
117125
# picks up choices/choices_field from the @benchmark registry itself).
118126
if bench.choices is not None:

packages/nemo-evaluator/src/nemo_evaluator/contrib/byob/decorators.py

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,8 @@ class BenchmarkDefinition:
9292
choices: Optional[List[str]] = None
9393
choices_field: Optional[str] = None
9494
num_fewshot: int = 0
95+
fewshot_dataset: Optional[str] = None
96+
fewshot_prefix: str = ""
9597
fewshot_split: Optional[str] = None
9698
fewshot_template: Optional[str] = None
9799
fewshot_separator: str = "\n\n"
@@ -170,6 +172,8 @@ def benchmark(
170172
choices: Optional[List[str]] = None,
171173
choices_field: Optional[str] = None,
172174
num_fewshot: int = 0,
175+
fewshot_dataset: Optional[str] = None,
176+
fewshot_prefix: str = "",
173177
fewshot_split: Optional[str] = None,
174178
fewshot_template: Optional[str] = None,
175179
fewshot_separator: str = "\n\n",
@@ -209,11 +213,18 @@ def benchmark(
209213
when both are set on a per-row basis.
210214
num_fewshot: Number of few-shot examples to prepend to each
211215
prompt. Examples are sampled deterministically from
212-
``fewshot_split`` (or the first ``num_fewshot`` rows of the
213-
evaluation dataset when ``fewshot_split`` is None).
216+
``fewshot_dataset`` if provided, then ``fewshot_split`` if
217+
provided, otherwise from the evaluation dataset fallback pool.
218+
fewshot_dataset: Optional explicit dataset path or URI to sample
219+
few-shot examples from. Prefer this over ``fewshot_split`` when
220+
the few-shot source requires filters, data files, configs, or
221+
other URI options that cannot be represented by a split name.
222+
fewshot_prefix: Optional static text prepended before rendered
223+
few-shot examples. Useful for introducing or delimiting examples.
214224
fewshot_split: HuggingFace split name to sample few-shot examples
215225
from (e.g. ``"train"`` or ``"dev"``). Only meaningful when the
216-
primary ``dataset`` is an ``hf://`` URI.
226+
primary ``dataset`` is an ``hf://`` URI and ``fewshot_dataset``
227+
is not set.
217228
fewshot_template: Optional template string used to render each
218229
few-shot example. ``None`` reuses the main ``prompt`` template
219230
and appends the rendered ``target_field`` value.
@@ -308,6 +319,8 @@ def decorator(fn):
308319
choices=list(choices) if choices is not None else None,
309320
choices_field=choices_field,
310321
num_fewshot=num_fewshot,
322+
fewshot_dataset=fewshot_dataset,
323+
fewshot_prefix=fewshot_prefix,
311324
fewshot_split=fewshot_split,
312325
fewshot_template=resolved_fewshot_template,
313326
fewshot_separator=fewshot_separator,

packages/nemo-evaluator/src/nemo_evaluator/contrib/byob/eval_logic.py

Lines changed: 38 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -589,8 +589,9 @@ def render_fewshot_example(bench: BenchmarkDefinition, row: Dict) -> Optional[st
589589
def build_fewshot_prefix(bench: BenchmarkDefinition, examples: List[Dict]) -> str:
590590
"""Render *examples* into a prefix string ready to prepend to each prompt.
591591
592-
Skips examples that fail to render (missing fields). Always appends the
593-
benchmark's ``fewshot_separator`` after the last example so the test
592+
Skips examples that fail to render (missing fields). If configured,
593+
``bench.fewshot_prefix`` is prepended before the examples. Always appends
594+
the benchmark's ``fewshot_separator`` after the last example so the test
594595
prompt starts on a fresh boundary.
595596
"""
596597
if not examples:
@@ -602,45 +603,71 @@ def build_fewshot_prefix(bench: BenchmarkDefinition, examples: List[Dict]) -> st
602603
rendered.append(text)
603604
if not rendered:
604605
return ""
605-
# Use ``is None`` rather than ``or`` so an explicit empty-string
606-
# separator (concat with no delimiter) is honoured.
607606
sep = bench.fewshot_separator if bench.fewshot_separator is not None else "\n\n"
608-
return sep.join(rendered) + sep
607+
prefix = getattr(bench, "fewshot_prefix", "") or ""
608+
return prefix + sep.join(rendered) + sep
609609

610610

611611
def build_fewshot_examples(
612612
primary_dataset_uri: str,
613613
primary_dataset: List[Dict],
614614
num_fewshot: int,
615615
fewshot_split: Optional[str],
616+
fewshot_dataset: Optional[str] = None,
616617
field_mapping: Optional[Dict[str, str]] = None,
617618
seed: int = 42,
618619
) -> List[Dict]:
619620
"""Sample ``num_fewshot`` examples deterministically (lm-eval style).
620621
621622
Selection rules (in order):
622623
623-
1. If ``fewshot_split`` is set and the primary dataset URI is an
624+
1. If ``fewshot_dataset`` is set, load that exact dataset URI/path.
625+
Use this when the safe few-shot source needs filters, data files,
626+
configs, or other options that cannot be inferred from a split name.
627+
2. If ``fewshot_split`` is set and the primary dataset URI is an
624628
``hf://`` URI, load that split via the dataset module and sample
625629
``num_fewshot`` rows. This is the **safe** path — examples come
626630
from a different split than the test set, so there is no
627631
contamination.
628-
2. Otherwise, sample ``num_fewshot`` rows from the **tail** of
632+
3. Otherwise, sample ``num_fewshot`` rows from the **tail** of
629633
``primary_dataset`` (i.e. the rows least likely to be evaluated
630634
when ``--limit-samples`` is set). A loud warning is logged
631635
because the fewshot pool overlaps with the evaluation set when
632636
running the full dataset, which can leak gold answers into the
633637
prompt.
634638
635-
To guarantee no contamination, declare a ``fewshot_split`` on the
636-
``@benchmark`` (e.g. ``"train"`` or ``"dev"``) so this function
637-
samples from a disjoint split.
639+
To guarantee no contamination, declare a ``fewshot_dataset`` or
640+
``fewshot_split`` on the ``@benchmark`` so this function samples from a
641+
disjoint source.
638642
"""
639643
if num_fewshot <= 0:
640644
return []
641645

642646
pool: List[Dict] = []
643-
if fewshot_split and primary_dataset_uri.startswith("hf://"):
647+
if fewshot_dataset:
648+
try:
649+
from nemo_evaluator.contrib.byob.dataset import load_dataset
650+
651+
pool = load_dataset(
652+
fewshot_dataset,
653+
limit=max(num_fewshot * 4, 16),
654+
field_mapping=field_mapping,
655+
)
656+
if not pool:
657+
logger.debug(
658+
"fewshot_dataset loaded successfully but returned 0 rows; "
659+
"falling back to fewshot_split or primary dataset",
660+
fewshot_dataset=fewshot_dataset,
661+
)
662+
except Exception as e:
663+
logger.warning(
664+
"Failed to load fewshot_dataset, falling back to fewshot_split or primary dataset",
665+
fewshot_dataset=fewshot_dataset,
666+
error=str(e),
667+
)
668+
pool = []
669+
670+
if not pool and fewshot_split and primary_dataset_uri.startswith("hf://"):
644671
try:
645672
from nemo_evaluator.contrib.byob.dataset import load_dataset
646673

@@ -662,13 +689,6 @@ def build_fewshot_examples(
662689
pool = []
663690

664691
if not pool:
665-
# Fallback: no separate fewshot split is available. Sample from
666-
# the tail of the primary dataset to minimise overlap with the
667-
# eval set when the user passes --limit-samples (which iterates
668-
# from the head). When the full dataset is evaluated, the
669-
# fewshot pool is a strict subset of the eval set and gold
670-
# answers can leak — warn loudly so the user knows to declare
671-
# ``fewshot_split=`` on the @benchmark.
672692
logger.warning(
673693
"fewshot_split not available; sampling from primary dataset. "
674694
"This risks test-set contamination because the fewshot pool "
@@ -679,8 +699,6 @@ def build_fewshot_examples(
679699
primary_dataset_size=len(primary_dataset),
680700
)
681701
pool_size = max(num_fewshot * 4, num_fewshot)
682-
# Tail slice — falls back to the head only if the dataset is
683-
# smaller than the desired pool.
684702
if len(primary_dataset) > pool_size:
685703
pool = primary_dataset[-pool_size:]
686704
else:
@@ -741,11 +759,6 @@ def run_eval_loop(
741759
endpoint_type == "completions_logprob"
742760
or bench.endpoint_type == "completions_logprob"
743761
):
744-
# Logprob-mode MCQ ranking is the only strategy that requires
745-
# ``choices`` / ``choices_field``; the @benchmark decorator
746-
# already validates that pairing. Don't auto-pick MCQ just
747-
# because choices are declared — a user may declare them as
748-
# informational metadata while running the chat endpoint.
749762
strategy = MultipleChoiceStrategy()
750763
else:
751764
strategy = StandardStrategy()
@@ -816,8 +829,6 @@ def _run_eval_loop_sequential(
816829
progress_interval = max(1, min(10, total // 10)) if total > 0 else 1
817830

818831
for idx, row in enumerate(dataset):
819-
# Pass fewshot_prefix only when non-empty so legacy strategy
820-
# implementations (without the kwarg) continue to work.
821832
kwargs = {"fewshot_prefix": fewshot_prefix} if fewshot_prefix else {}
822833
scores, prediction = strategy.evaluate_sample(
823834
idx,

packages/nemo-evaluator/src/nemo_evaluator/contrib/byob/runner.py

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -747,10 +747,10 @@ def main():
747747
parser.add_argument(
748748
"--num-fewshot",
749749
type=int,
750-
default=0,
750+
default=None,
751751
help=(
752752
"Number of few-shot examples to prepend to each prompt "
753-
"(default: 0). Examples are sampled deterministically from the "
753+
"(default: benchmark default, usually 0). Examples are sampled deterministically from the "
754754
"benchmark's fewshot_split (or the first --num-fewshot rows of "
755755
"the same dataset when fewshot_split is not declared)."
756756
),
@@ -781,15 +781,16 @@ def main():
781781
field_mapping=bench.field_mapping,
782782
)
783783

784-
# Resolve few-shot examples: precedence is CLI flag > benchmark default.
785-
# Robust to mocked benchmark objects (tests use MagicMock) where
786-
# ``bench.num_fewshot`` may not be a real int.
784+
# Resolve few-shot examples: an explicit CLI value, including 0, must
785+
# override the benchmark default. This is required for true 0-shot
786+
# validation of benchmarks that declare a non-zero default.
787787
effective_num_fewshot = 0
788-
try:
789-
effective_num_fewshot = int(args.num_fewshot or 0)
790-
except (TypeError, ValueError):
791-
effective_num_fewshot = 0
792-
if not effective_num_fewshot:
788+
if args.num_fewshot is not None:
789+
try:
790+
effective_num_fewshot = int(args.num_fewshot)
791+
except (TypeError, ValueError):
792+
effective_num_fewshot = 0
793+
else:
793794
try:
794795
effective_num_fewshot = int(getattr(bench, "num_fewshot", 0) or 0)
795796
except (TypeError, ValueError):
@@ -801,6 +802,7 @@ def main():
801802
primary_dataset=dataset,
802803
num_fewshot=effective_num_fewshot,
803804
fewshot_split=bench.fewshot_split,
805+
fewshot_dataset=bench.fewshot_dataset,
804806
field_mapping=bench.field_mapping,
805807
seed=args.fewshot_seed,
806808
)

0 commit comments

Comments
 (0)