Multimodality/sft bestfit sep classes by dtamayo-nlp · Pull Request #113 · swiss-ai/Megatron-LM

dtamayo-nlp · 2026-04-24T16:10:25Z

Motivation

The previous implementation worked correctly for SFT, but did not scale to pre-training workloads and eventually led to NCCL timeouts in tests with 4,000 files and billions of tokens.
This PR corrects issues raised in an older PR.

Features Introduced

The main bottleneck was the packing stage:

bisect.insort() incurs O(N) insertion cost, making the overall complexity O(N log N + NxB), which degrades to ~O(N^2) for large corpora (N = number of documents, B = number of bins). Replacing it with SortedList reduced insertion to O(log N) and improved the complexity to O(N log N), but Python overhead remained a bottleneck at scale.
To fully address this, the default implementation was changed to C++ (python implementation is still available).

To correctly adapt the pipeline for pre-training, the following changes were also introduced:

Support for mixed data regimes: the pipeline now handles both SFT (using ApertusSFTDataset) and raw pre-training data (using GPTDataset). SFT data must contain apertus_sft in the name, in a future PR, this will be cleaned up.
Adaptation of the BFD algorithm to the GPTDataset class.
Chunking instead of truncation in pre-training: documents longer than seq_len are split into fixed-length chunks prior to packing.
max_doc_per_bin setting: for longer sequences, we want to avoid having too many small samples concatenated into a single sequence. This argument limits the number of documents packed during pre-training.

In the SFT case, we also incorporate the possibility of predicting LONG CONTEXT TEXT following this format for a specific long-context task:

<s>LONG CONTEXT TEXT<|system_start|><|system_end|>
<|developer_start|>Deliberation: disabled
Tool Capabilities: disabled<|developer_end|>
<|user_start|>How many times does the word "potato" appears?<|user_end|>
<|assistant_start|>The word potato appears 1234 times<|assistant_end|>

Comments from the Previous PR

Attention

As already discussed with @RaphaelKreft, if we have a very long document that is chunked into parts:

<s> PART 1 | PART 2 | PART 3 </s>

Each part already fills the entire sample because its size equals seq_len, there is no need to add BOS/EOS tokens, given that there will be no additional documents.

Bugs Addressed (see previous PR)

Loss Masking 1: Large User Inputs Could Cause Predicting User Queries

In the cooldown regime, we will not have these samples, but the current PR addresses the challenge of large user inputs by truncating the samples, as was previously implemented in the ApertusSFTDataset sampler.
IMO, we should pre-filter them to avoid wasting compute.

Loss Masking 2: Loss Issues with Goldfish and Multimodality

This has been solved by separating the two dataset types (pre-training/SFT).

Arguments Added

To launch SFT + pre-training, you will need to add these arguments to your launcher script:

--ap-sft
--ap-sft-mask-special-tokens
--ap-sft-pack-samples
--ap-sft-packing-strategy bfd
--pretraining-packing-strategy bfd
--max-docs-per-bin 64

Tests Done

Test 1: Pre-training

Comparison between pre-training in multimodality/main, greedy pre-training in this branch, and BFD pre-training in this branch.

Conclusions:

Greedy pre-training between multimodality/main and this branch is equivalent.
Best-fit starts at a higher loss but quickly recovers.
max_doc_per_bin in the best-fit approach helps significantly preserve throughput.

Test 2: SFT

Comparison between SFT in multimodality/sft-feat-bestfit-packing and this branch. I trained both models on a retrieval task that asks the model to retrieve information from a dictionary. Given the C++ implementation, there are some differences after some samples, but the overall trend is similar.

Test 3: Full Pre-training + SFT

Evaluating this new framework with pre-training + SFT data gives good results in both, short context and long context tasks.

…m_docs per bin

…WE task in long context

RaphaelKreft

Overall looks correct.

I added some comments to minor / medium issues that are mainly not related to correctness. Rather some style and broken upstream code.

Anyway, the initialize_sft_dataset script is broken due to new way datasets are distinguished by type (sft vs pretrain)

RaphaelKreft · 2026-05-04T11:51:17Z

            List[Optional[MidLevelDataset]]: The MidLevelDataset (or None) per split
        """
+
+        if "apertus_sft" in dataset_path:


We should do it in a more clever way. I would suggest:

Besides the dataset path and an optional blend ration, we add another optional arg for the dataset-type
For example for two data paths:

Old: --data-path 0.1 [PATH1] 0.9 [PATH2]

New Option 1: separate arg per dset path --data-path **SFT** 0.1 [PATH1] **GPT** 0.9 [PATH2]

New Option 2: use prefix in dset path--data-path 0.1 SFT:[PATH1] 0.9 GPT:[PATH2]

We can choose arbitrary type handles for SFT (Apertus SFT Dataset) or GPT (GPTDataset for pretrain data). Option 2 has the benefit of being easier to implement / less modifications necessary.

To modify:

**For Option1: ** in megatron/core/datasets/utils.py - get_blend_from_list: add third branch for list of triples (dset-type, blend ratio and actual path)

For Option 2: Add minimal parsing of datasetpath (split by : etc) in megatron/core/datasets/blended_megatron_dataset_builder.py

of course we have to do document it in the ReadMe as well

Another issue with thi is, that features from upstream are broken this way: MockGPTDataset, GPTFIMDataset, SFTDataset (JSONL) can no longer be selected through this builder. --mock-data, --fim-data, --sft are no-ops for class selection. Same in pretrain_gpt.py after the deletion of the dispatch block.

RaphaelKreft · 2026-05-04T14:27:10Z

This file is a near duplicate of megatron/core/datasets/bfd_pack.cpp. Only real difference I can see is regarding how the zero length documents are treated and the max_docs_per_bin argument.

@dtamayo-nlp You think you could consolidate or is there another reason to have separate files here that I missed?

RaphaelKreft · 2026-05-04T15:09:55Z


    config = core_gpt_dataset_config_from_args(args)

-    if args.ap_sft:


See comment on blended_megatron_dataset_builder.py regarding dataset selection logic

RaphaelKreft · 2026-05-04T15:16:34Z

            if self.sft_plw_value > 0:
                loss_mask[~assistant_mask] = self.sft_plw_value # value is 0 by default for full masking

+            # Also unmask tokens between BOS and <|system_start|> (pre-system content). Relevant for Long Context Tasks


for optimal performance this should be toggled via a flag as argument. Improves performance in case of non-long context data

dtamayo-nlp added 6 commits April 23, 2026 22:42

Feat: Allow Pre-Training + SFT separating the classes

6b1e051

Feat: Add C++ Implementation to SFT

3b2c3f2

Feat: Introduce BFD in pre-training

de0e0a4

Fix: Optimize throughput for large node counts by implementing max_nu…

faab1e8

…m_docs per bin

Feat: Allow loss_mask between <s> and <|system_prompt|>, useful for C…

fee5b30

…WE task in long context

clean: Introduce arguments and clean implementation

3007afb

dtamayo-nlp assigned RaphaelKreft and Alvorecer721 and unassigned Alvorecer721 and RaphaelKreft Apr 24, 2026

dtamayo-nlp requested review from Alvorecer721 and RaphaelKreft April 24, 2026 16:11

dtamayo-nlp mentioned this pull request Apr 24, 2026

feat: Update SFT dataset for best-fit pretraining #111

Closed

Fix: Add Final Argument

ad6b2d6

dtamayo-nlp force-pushed the multimodality/sft-bestfit-sep-classes branch from b0981d8 to ad6b2d6 Compare April 24, 2026 21:12

RaphaelKreft reviewed May 4, 2026

View reviewed changes

dtamayo-nlp and others added 3 commits May 8, 2026 11:21

Fix: Add sequence parallelism support

87099b1

Fix packing statistics logging for apertus_sft_dataset.py

07ae2df

Add SFT packing preflight tool

c2e8388

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodality/sft bestfit sep classes#113

Multimodality/sft bestfit sep classes#113
dtamayo-nlp wants to merge 10 commits into
multimodality/sft-feat-bestfit-packingfrom
multimodality/sft-bestfit-sep-classes

dtamayo-nlp commented Apr 24, 2026

Uh oh!

RaphaelKreft left a comment

Uh oh!

RaphaelKreft May 4, 2026

Uh oh!

RaphaelKreft May 4, 2026

Uh oh!

RaphaelKreft May 4, 2026

Uh oh!

RaphaelKreft May 4, 2026

Uh oh!

RaphaelKreft May 4, 2026

Uh oh!

RaphaelKreft May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		config = core_gpt_dataset_config_from_args(args)

		if args.ap_sft:

Conversation

dtamayo-nlp commented Apr 24, 2026

Motivation

Features Introduced

Comments from the Previous PR

Attention

Bugs Addressed (see previous PR)

Loss Masking 1: Large User Inputs Could Cause Predicting User Queries

Loss Masking 2: Loss Issues with Goldfish and Multimodality

Arguments Added

Tests Done

Test 1: Pre-training

Test 2: SFT

Test 3: Full Pre-training + SFT

Uh oh!

RaphaelKreft left a comment

Choose a reason for hiding this comment

Uh oh!

RaphaelKreft May 4, 2026

Choose a reason for hiding this comment

Uh oh!

RaphaelKreft May 4, 2026

Choose a reason for hiding this comment

Uh oh!

RaphaelKreft May 4, 2026

Choose a reason for hiding this comment

Uh oh!

RaphaelKreft May 4, 2026

Choose a reason for hiding this comment

Uh oh!

RaphaelKreft May 4, 2026

Choose a reason for hiding this comment

Uh oh!

RaphaelKreft May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants