Skip to content

Multimodality/sft bestfit sep classes#113

Open
dtamayo-nlp wants to merge 10 commits into
multimodality/sft-feat-bestfit-packingfrom
multimodality/sft-bestfit-sep-classes
Open

Multimodality/sft bestfit sep classes#113
dtamayo-nlp wants to merge 10 commits into
multimodality/sft-feat-bestfit-packingfrom
multimodality/sft-bestfit-sep-classes

Conversation

@dtamayo-nlp
Copy link
Copy Markdown
Collaborator

Motivation

The previous implementation worked correctly for SFT, but did not scale to pre-training workloads and eventually led to NCCL timeouts in tests with 4,000 files and billions of tokens.
This PR corrects issues raised in an older PR.

Features Introduced

The main bottleneck was the packing stage:

  • bisect.insort() incurs O(N) insertion cost, making the overall complexity O(N log N + NxB), which degrades to ~O(N^2) for large corpora (N = number of documents, B = number of bins). Replacing it with SortedList reduced insertion to O(log N) and improved the complexity to O(N log N), but Python overhead remained a bottleneck at scale.
    To fully address this, the default implementation was changed to C++ (python implementation is still available).

To correctly adapt the pipeline for pre-training, the following changes were also introduced:

  • Support for mixed data regimes: the pipeline now handles both SFT (using ApertusSFTDataset) and raw pre-training data (using GPTDataset). SFT data must contain apertus_sft in the name, in a future PR, this will be cleaned up.
  • Adaptation of the BFD algorithm to the GPTDataset class.
  • Chunking instead of truncation in pre-training: documents longer than seq_len are split into fixed-length chunks prior to packing.
  • max_doc_per_bin setting: for longer sequences, we want to avoid having too many small samples concatenated into a single sequence. This argument limits the number of documents packed during pre-training.

In the SFT case, we also incorporate the possibility of predicting LONG CONTEXT TEXT following this format for a specific long-context task:

<s>LONG CONTEXT TEXT<|system_start|><|system_end|>
<|developer_start|>Deliberation: disabled
Tool Capabilities: disabled<|developer_end|>
<|user_start|>How many times does the word "potato" appears?<|user_end|>
<|assistant_start|>The word potato appears 1234 times<|assistant_end|>

Comments from the Previous PR

Attention

As already discussed with @RaphaelKreft, if we have a very long document that is chunked into parts:

<s> PART 1 | PART 2 | PART 3 </s>

Each part already fills the entire sample because its size equals seq_len, there is no need to add BOS/EOS tokens, given that there will be no additional documents.

Bugs Addressed (see previous PR)

Loss Masking 1: Large User Inputs Could Cause Predicting User Queries

In the cooldown regime, we will not have these samples, but the current PR addresses the challenge of large user inputs by truncating the samples, as was previously implemented in the ApertusSFTDataset sampler.
IMO, we should pre-filter them to avoid wasting compute.

Loss Masking 2: Loss Issues with Goldfish and Multimodality

This has been solved by separating the two dataset types (pre-training/SFT).

Arguments Added

To launch SFT + pre-training, you will need to add these arguments to your launcher script:

--ap-sft
--ap-sft-mask-special-tokens
--ap-sft-pack-samples
--ap-sft-packing-strategy bfd
--pretraining-packing-strategy bfd
--max-docs-per-bin 64

Tests Done

Test 1: Pre-training

Comparison between pre-training in multimodality/main, greedy pre-training in this branch, and BFD pre-training in this branch.

Conclusions:

  • Greedy pre-training between multimodality/main and this branch is equivalent.
  • Best-fit starts at a higher loss but quickly recovers.
  • max_doc_per_bin in the best-fit approach helps significantly preserve throughput.
image image

Test 2: SFT

Comparison between SFT in multimodality/sft-feat-bestfit-packing and this branch. I trained both models on a retrieval task that asks the model to retrieve information from a dictionary. Given the C++ implementation, there are some differences after some samples, but the overall trend is similar.
image

Test 3: Full Pre-training + SFT

Evaluating this new framework with pre-training + SFT data gives good results in both, short context and long context tasks.

@dtamayo-nlp dtamayo-nlp force-pushed the multimodality/sft-bestfit-sep-classes branch from b0981d8 to ad6b2d6 Compare April 24, 2026 21:12
Copy link
Copy Markdown

@RaphaelKreft RaphaelKreft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks correct.

I added some comments to minor / medium issues that are mainly not related to correctness. Rather some style and broken upstream code.

Anyway, the initialize_sft_dataset script is broken due to new way datasets are distinguished by type (sft vs pretrain)

List[Optional[MidLevelDataset]]: The MidLevelDataset (or None) per split
"""

if "apertus_sft" in dataset_path:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do it in a more clever way. I would suggest:

Besides the dataset path and an optional blend ration, we add another optional arg for the dataset-type
For example for two data paths:

  • Old: --data-path 0.1 [PATH1] 0.9 [PATH2]
  • New Option 1: separate arg per dset path --data-path **SFT** 0.1 [PATH1] **GPT** 0.9 [PATH2]
  • New Option 2: use prefix in dset path--data-path 0.1 SFT:[PATH1] 0.9 GPT:[PATH2]

We can choose arbitrary type handles for SFT (Apertus SFT Dataset) or GPT (GPTDataset for pretrain data). Option 2 has the benefit of being easier to implement / less modifications necessary.

To modify:

  • **For Option1: ** in megatron/core/datasets/utils.py - get_blend_from_list: add third branch for list of triples (dset-type, blend ratio and actual path)
  • For Option 2: Add minimal parsing of datasetpath (split by : etc) in megatron/core/datasets/blended_megatron_dataset_builder.py

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course we have to do document it in the ReadMe as well

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue with thi is, that features from upstream are broken this way: MockGPTDataset, GPTFIMDataset, SFTDataset (JSONL) can no longer be selected through this builder. --mock-data, --fim-data, --sft are no-ops for class selection. Same in pretrain_gpt.py after the deletion of the dispatch block.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is a near duplicate of megatron/core/datasets/bfd_pack.cpp. Only real difference I can see is regarding how the zero length documents are treated and the max_docs_per_bin argument.

@dtamayo-nlp You think you could consolidate or is there another reason to have separate files here that I missed?

Comment thread pretrain_gpt.py

config = core_gpt_dataset_config_from_args(args)

if args.ap_sft:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment on blended_megatron_dataset_builder.py regarding dataset selection logic

if self.sft_plw_value > 0:
loss_mask[~assistant_mask] = self.sft_plw_value # value is 0 by default for full masking

# Also unmask tokens between BOS and <|system_start|> (pre-system content). Relevant for Long Context Tasks
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for optimal performance this should be toggled via a flag as argument. Improves performance in case of non-long context data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants