Skip to content

Add better handling of sft vs pretrain dataset selection#119

Open
RaphaelKreft wants to merge 1 commit into
multimodality/sft-bestfit-sep-classesfrom
multimodality/improve_dset_ident
Open

Add better handling of sft vs pretrain dataset selection#119
RaphaelKreft wants to merge 1 commit into
multimodality/sft-bestfit-sep-classesfrom
multimodality/improve_dset_ident

Conversation

@RaphaelKreft
Copy link
Copy Markdown

What changed

Old: a substring check ("apertus_sft" in dataset_path) decided whether a dataset was ApertusSFTDataset or GPTDataset. It ignored --ap-sft.

New: an explicit per-entry marker sft:<prefix> / pretrain:<prefix> (case-insensitive). Mixed SFT + pretrain blends are now more explicit.

--data-path 0.3 sft:/data/dolly 0.7 pretrain:/data/fineweb

Applies to every blend arg (--data-path, --{train,valid,test}-data-path, --data-args-path, --per-split-data-args-path).

Precedence (first match wins)

# Condition Dispatch Warning
1 Marker sft: / pretrain: on the path per marker
2 No marker, --ap-sft set ApertusSFTDataset
3 No marker, --ap-sft not set, path contains apertus_sft ApertusSFTDataset DeprecationWarning
4 Otherwise GPTDataset
Mock config / None path MockGPTDataset

Only the leading marker is stripped (sft:sft:/x → type sft, path sft:/x). Empty path after marker (sft:) raises AssertionError. Cache identity is keyed on the stripped path.

Behavior diff vs before

Scenario Old New
--ap-sft --data-path 1.0 /data/dolly GPTDataset (latent bug — flag was ignored) ApertusSFTDataset (rule 2)
Mixed sft: + pretrain: entries not supported works
--data-path 1.0 /data/apertus_sft_foo SFT, silent SFT + DeprecationWarning
SFT:/data/x (uppercase) treated as literal path ApertusSFTDataset
Pure pretrain / all-pretrain pretrain pretrain (unchanged)
All-SFT with apertus_sft in path SFT SFT (unchanged, now warns)

--ap-sft semantics

Still required for any SFT run. It now does two things:

  1. Gates the --calculate-per-token-loss assertion + SFT loss-reduction paths (unchanged).
  2. Auto-tags unmarked entries as SFT (new — rule 2 above).

Rule of thumb: all-SFT → set --ap-sft, no markers needed. Mixed → set --ap-sft + mark every entry explicitly. All-pretrain → neither.

Migration

  • Existing --ap-sft launchers: no change needed.
  • Paths with apertus_sft in the name: still work, but emit DeprecationWarning — migrate to sft: when convenient.
  • Pretrain datasets whose names happen to contain apertus_sft: prefix with pretrain: to opt out of the legacy fallback.

Implementation

Concern Location
Marker parser megatron/core/datasets/utils.pysplit_dataset_type_marker()
Builder dispatch megatron/core/datasets/blended_megatron_dataset_builder.py_resolve_dataset_class()
Config field megatron/core/datasets/gpt_dataset.pyGPTDatasetConfig.ap_sft_auto_tag
Flag wiring pretrain_gpt.py, initialize_sft_dataset.py
Argparse help megatron/training/arguments.py (--data-path, --ap-sft)
Helper scripts/tools/create_weighted_data_config.py --dataset-type {sft,pretrain,none}
Tests tests/unit_tests/data/test_dataset_type_marker.py

@RaphaelKreft
Copy link
Copy Markdown
Author

Would be interested about you opinion on this design. I think its quite clean.

If you agree I will test it on the cluster and merge

@dtamayo-nlp
Copy link
Copy Markdown
Collaborator

I agree, thanks for taking care of this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants