Add better handling of sft vs pretrain dataset selection by RaphaelKreft · Pull Request #119 · swiss-ai/Megatron-LM

RaphaelKreft · 2026-05-20T11:10:28Z

What changed

Old: a substring check ("apertus_sft" in dataset_path) decided whether a dataset was ApertusSFTDataset or GPTDataset. It ignored --ap-sft.

New: an explicit per-entry marker sft:<prefix> / pretrain:<prefix> (case-insensitive). Mixed SFT + pretrain blends are now more explicit.

--data-path 0.3 sft:/data/dolly 0.7 pretrain:/data/fineweb

Applies to every blend arg (--data-path, --{train,valid,test}-data-path, --data-args-path, --per-split-data-args-path).

Precedence (first match wins)

#	Condition	Dispatch	Warning
1	Marker `sft:` / `pretrain:` on the path	per marker	—
2	No marker, `--ap-sft` set	`ApertusSFTDataset`	—
3	No marker, `--ap-sft` not set, path contains `apertus_sft`	`ApertusSFTDataset`	`DeprecationWarning`
4	Otherwise	`GPTDataset`	—
—	Mock config / `None` path	`MockGPTDataset`	—

Only the leading marker is stripped (sft:sft:/x → type sft, path sft:/x). Empty path after marker (sft:) raises AssertionError. Cache identity is keyed on the stripped path.

Behavior diff vs before

Scenario	Old	New
`--ap-sft --data-path 1.0 /data/dolly`	`GPTDataset` (latent bug — flag was ignored)	`ApertusSFTDataset` (rule 2)
Mixed `sft:` + `pretrain:` entries	not supported	works
`--data-path 1.0 /data/apertus_sft_foo`	SFT, silent	SFT + `DeprecationWarning`
`SFT:/data/x` (uppercase)	treated as literal path	`ApertusSFTDataset`
Pure pretrain / all-pretrain	pretrain	pretrain (unchanged)
All-SFT with `apertus_sft` in path	SFT	SFT (unchanged, now warns)

`--ap-sft` semantics

Still required for any SFT run. It now does two things:

Gates the --calculate-per-token-loss assertion + SFT loss-reduction paths (unchanged).
Auto-tags unmarked entries as SFT (new — rule 2 above).

Rule of thumb: all-SFT → set --ap-sft, no markers needed. Mixed → set --ap-sft + mark every entry explicitly. All-pretrain → neither.

Migration

Existing --ap-sft launchers: no change needed.
Paths with apertus_sft in the name: still work, but emit DeprecationWarning — migrate to sft: when convenient.
Pretrain datasets whose names happen to contain apertus_sft: prefix with pretrain: to opt out of the legacy fallback.

Implementation

Concern	Location
Marker parser	`megatron/core/datasets/utils.py` — `split_dataset_type_marker()`
Builder dispatch	`megatron/core/datasets/blended_megatron_dataset_builder.py` — `_resolve_dataset_class()`
Config field	`megatron/core/datasets/gpt_dataset.py` — `GPTDatasetConfig.ap_sft_auto_tag`
Flag wiring	`pretrain_gpt.py`, `initialize_sft_dataset.py`
Argparse help	`megatron/training/arguments.py` (`--data-path`, `--ap-sft`)
Helper	`scripts/tools/create_weighted_data_config.py --dataset-type {sft,pretrain,none}`
Tests	`tests/unit_tests/data/test_dataset_type_marker.py`

RaphaelKreft · 2026-05-20T11:12:11Z

Would be interested about you opinion on this design. I think its quite clean.

If you agree I will test it on the cluster and merge

dtamayo-nlp · 2026-05-20T15:19:45Z

I agree, thanks for taking care of this!

Add better handling of sft vs pretrain dataset selection

e829a60

RaphaelKreft requested review from Alvorecer721 and dtamayo-nlp May 20, 2026 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add better handling of sft vs pretrain dataset selection#119

Add better handling of sft vs pretrain dataset selection#119
RaphaelKreft wants to merge 1 commit into
multimodality/sft-bestfit-sep-classesfrom
multimodality/improve_dset_ident

RaphaelKreft commented May 20, 2026

Uh oh!

RaphaelKreft commented May 20, 2026

Uh oh!

dtamayo-nlp commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RaphaelKreft commented May 20, 2026

What changed

Precedence (first match wins)

Behavior diff vs before

--ap-sft semantics

Migration

Implementation

Uh oh!

RaphaelKreft commented May 20, 2026

Uh oh!

dtamayo-nlp commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`--ap-sft` semantics