Multimodality/sft bestfit sep classes#113
Conversation
…WE task in long context
b0981d8 to
ad6b2d6
Compare
RaphaelKreft
left a comment
There was a problem hiding this comment.
Overall looks correct.
I added some comments to minor / medium issues that are mainly not related to correctness. Rather some style and broken upstream code.
Anyway, the initialize_sft_dataset script is broken due to new way datasets are distinguished by type (sft vs pretrain)
| List[Optional[MidLevelDataset]]: The MidLevelDataset (or None) per split | ||
| """ | ||
|
|
||
| if "apertus_sft" in dataset_path: |
There was a problem hiding this comment.
We should do it in a more clever way. I would suggest:
Besides the dataset path and an optional blend ration, we add another optional arg for the dataset-type
For example for two data paths:
- Old:
--data-path 0.1 [PATH1] 0.9 [PATH2] - New Option 1: separate arg per dset path
--data-path **SFT** 0.1 [PATH1] **GPT** 0.9 [PATH2] - New Option 2: use prefix in dset path
--data-path 0.1 SFT:[PATH1] 0.9 GPT:[PATH2]
We can choose arbitrary type handles for SFT (Apertus SFT Dataset) or GPT (GPTDataset for pretrain data). Option 2 has the benefit of being easier to implement / less modifications necessary.
To modify:
- **For Option1: ** in megatron/core/datasets/utils.py - get_blend_from_list: add third branch for list of triples (dset-type, blend ratio and actual path)
- For Option 2: Add minimal parsing of datasetpath (split by : etc) in
megatron/core/datasets/blended_megatron_dataset_builder.py
There was a problem hiding this comment.
of course we have to do document it in the ReadMe as well
There was a problem hiding this comment.
Another issue with thi is, that features from upstream are broken this way: MockGPTDataset, GPTFIMDataset, SFTDataset (JSONL) can no longer be selected through this builder. --mock-data, --fim-data, --sft are no-ops for class selection. Same in pretrain_gpt.py after the deletion of the dispatch block.
There was a problem hiding this comment.
This file is a near duplicate of megatron/core/datasets/bfd_pack.cpp. Only real difference I can see is regarding how the zero length documents are treated and the max_docs_per_bin argument.
@dtamayo-nlp You think you could consolidate or is there another reason to have separate files here that I missed?
|
|
||
| config = core_gpt_dataset_config_from_args(args) | ||
|
|
||
| if args.ap_sft: |
There was a problem hiding this comment.
See comment on blended_megatron_dataset_builder.py regarding dataset selection logic
| if self.sft_plw_value > 0: | ||
| loss_mask[~assistant_mask] = self.sft_plw_value # value is 0 by default for full masking | ||
|
|
||
| # Also unmask tokens between BOS and <|system_start|> (pre-system content). Relevant for Long Context Tasks |
There was a problem hiding this comment.
for optimal performance this should be toggled via a flag as argument. Improves performance in case of non-long context data
Motivation
The previous implementation worked correctly for SFT, but did not scale to pre-training workloads and eventually led to NCCL timeouts in tests with 4,000 files and billions of tokens.
This PR corrects issues raised in an older PR.
Features Introduced
The main bottleneck was the packing stage:
To fully address this, the default implementation was changed to C++ (python implementation is still available).
To correctly adapt the pipeline for pre-training, the following changes were also introduced:
ApertusSFTDataset) and raw pre-training data (usingGPTDataset). SFT data must containapertus_sftin the name, in a future PR, this will be cleaned up.GPTDatasetclass.seq_lenare split into fixed-length chunks prior to packing.max_doc_per_binsetting: for longer sequences, we want to avoid having too many small samples concatenated into a single sequence. This argument limits the number of documents packed during pre-training.In the SFT case, we also incorporate the possibility of predicting LONG CONTEXT TEXT following this format for a specific long-context task:
Comments from the Previous PR
Attention
As already discussed with @RaphaelKreft, if we have a very long document that is chunked into parts:
Each part already fills the entire sample because its size equals
seq_len, there is no need to add BOS/EOS tokens, given that there will be no additional documents.Bugs Addressed (see previous PR)
Loss Masking 1: Large User Inputs Could Cause Predicting User Queries
In the cooldown regime, we will not have these samples, but the current PR addresses the challenge of large user inputs by truncating the samples, as was previously implemented in the ApertusSFTDataset sampler.
IMO, we should pre-filter them to avoid wasting compute.
Loss Masking 2: Loss Issues with Goldfish and Multimodality
This has been solved by separating the two dataset types (pre-training/SFT).
Arguments Added
To launch SFT + pre-training, you will need to add these arguments to your launcher script:
Tests Done
Test 1: Pre-training
Comparison between pre-training in
multimodality/main, greedy pre-training in this branch, and BFD pre-training in this branch.Conclusions:
multimodality/mainand this branch is equivalent.max_doc_per_binin the best-fit approach helps significantly preserve throughput.Test 2: SFT
Comparison between SFT in

multimodality/sft-feat-bestfit-packingand this branch. I trained both models on a retrieval task that asks the model to retrieve information from a dictionary. Given the C++ implementation, there are some differences after some samples, but the overall trend is similar.Test 3: Full Pre-training + SFT
Evaluating this new framework with pre-training + SFT data gives good results in both, short context and long context tasks.